Open pinkwah opened 9 months ago
Emphasizing the importance of (at least some) pre-validation I've noticed error messages in the logs of the following type:
Exception in scheduler task job-8_task: Command .... failed after 10 retries with exit code 255, output: "<empty>", and error: "mr7: No such queue. Job not submitted
where the logs could get easily cluttered when this one fails on all realization naturally.
Referenced in this one: #8116
@sondreso do you think it is fine to close this one as there is not a reasonable and easy way to check it?
Why is the way using bqueues
as outlined in the issue not an option? 🤔
Why is the way using
bqueues
as outlined in the issue not an option? 🤔
bqueues
is flaky and can still fail, ie. not reliable source of working queues. Additionally this would prolong the validation step substantially. @berland was there anything else we discussed?
Fixing this issue has merely been downprioritized, it is not impossible to do. The PoC was with using qsub
directly which could give the same kind of information that bqueues
could do (well, it cannot list the allowed queue names though).
The upside is not clear, it will reduce the log output to the screen for those running GUI if we are willing to wait for the status for the checks.
Fixing this issue has merely been downprioritized, it is not impossible to do.
This was my impression as well, and then I don't think the issue should be closed.
The intent of this issue, to give early and precise feedback to the user in case of problems with the queue system, is something we should strive for. (This issue is perhaps focused on the technical side of the problem, but the issue that was closed as a duplicate of this one is more focused on the user experience: https://github.com/equinor/ert/issues/8116). While we might not be able to do this in the suggester due to performance reasons, there is still a lot of room for accurate error messages in the case something goes wrong when submitting jobs.
Also, if we close issues due to technical reasons or implementation difficulty, we should document why in the issue. That makes it a lot easier to re-assess the issue in the future if assumption change.
We can make the following assumptions:
qsub /usr/bin/true
will execute on some queue even though it's not specified.The
LocalDriver
is an exception, but we may pretend it has a queue calledlocal
.We may check whether the queue is valid and exit early if there is an issue with their chosen queue. To do this, we should extend
Driver
with the following method:For
LocalDriver
this function does nothing.LSFDriver
may runbqueues
to verify that the user's queue exists, and raiseValueError
(or an appropriate exception) if it doesn't.This makes it possible to check that the queue seems okay long before we submit any jobs. Maybe we can have a
@classmethod
functioncheck_queue
which is ran when the GUI starts up, so we can show an error message to the user.