equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
104 stars 107 forks source link

Handling of `QUEUE_OPTION [..] QUEUE` in Scheduler #7112

Open pinkwah opened 9 months ago

pinkwah commented 9 months ago

We can make the following assumptions:

  1. Each HPC system has a way to choose a named queue.
  2. Each HPC system has a default queue it chooses. Ie, qsub /usr/bin/true will execute on some queue even though it's not specified.
  3. The user may enter their preferred queue, which the driver must attempt to use.
  4. The user may enter incorrect information.

The LocalDriver is an exception, but we may pretend it has a queue called local.

We may check whether the queue is valid and exit early if there is an issue with their chosen queue. To do this, we should extend Driver with the following method:

    async def use_queue(self, queue_name: str) -> None:
        """
        Submit jobs to `queue_name` queue.

        Raises:
            ValueError if there is an issue with the queue
        """

For LocalDriver this function does nothing. LSFDriver may run bqueues to verify that the user's queue exists, and raise ValueError (or an appropriate exception) if it doesn't.

This makes it possible to check that the queue seems okay long before we submit any jobs. Maybe we can have a @classmethod function check_queue which is ran when the GUI starts up, so we can show an error message to the user.

xjules commented 5 months ago

Emphasizing the importance of (at least some) pre-validation I've noticed error messages in the logs of the following type:

Exception in scheduler task job-8_task: Command .... failed after 10 retries with exit code 255, output: "<empty>", and error: "mr7: No such queue. Job not submitted

where the logs could get easily cluttered when this one fails on all realization naturally.

xjules commented 3 months ago

Referenced in this one: #8116

xjules commented 3 months ago

@sondreso do you think it is fine to close this one as there is not a reasonable and easy way to check it?

sondreso commented 3 months ago

Why is the way using bqueues as outlined in the issue not an option? 🤔

xjules commented 2 months ago

Why is the way using bqueues as outlined in the issue not an option? 🤔

bqueues is flaky and can still fail, ie. not reliable source of working queues. Additionally this would prolong the validation step substantially. @berland was there anything else we discussed?

berland commented 2 months ago

Fixing this issue has merely been downprioritized, it is not impossible to do. The PoC was with using qsub directly which could give the same kind of information that bqueues could do (well, it cannot list the allowed queue names though).

The upside is not clear, it will reduce the log output to the screen for those running GUI if we are willing to wait for the status for the checks.

sondreso commented 2 months ago

Fixing this issue has merely been downprioritized, it is not impossible to do.

This was my impression as well, and then I don't think the issue should be closed.

The intent of this issue, to give early and precise feedback to the user in case of problems with the queue system, is something we should strive for. (This issue is perhaps focused on the technical side of the problem, but the issue that was closed as a duplicate of this one is more focused on the user experience: https://github.com/equinor/ert/issues/8116). While we might not be able to do this in the suggester due to performance reasons, there is still a lot of room for accurate error messages in the case something goes wrong when submitting jobs.

Also, if we close issues due to technical reasons or implementation difficulty, we should document why in the issue. That makes it a lot easier to re-assess the issue in the future if assumption change.