equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
98 stars 104 forks source link

Verify that the selected queue type can be used #8116

Open lars-petter-hauge opened 3 weeks ago

lars-petter-hauge commented 3 weeks ago

Describe the bug

Bad traceback in case the cluster runner is ill configured. It would be nice if ert could check that the selected driver can be used before trying to submit jobs.

To reproduce Steps to reproduce the behaviour:

  1. Connect to equinor azure node
  2. ert gui my_config.ert
  3. Run experiment (IES/Smoother/ESMDA/Test)

Expected behaviour A better error message

Screenshots The following will be printed in terminal the amount of times we send qsub (so at least once for each realisation)

Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error:  Unknown Host.
qsub: cannot connect to server Please (errno=15008)"
Exception in scheduler task job-1_task: Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error: "Unknown Host.
qsub: cannot connect to server Please (errno=15008)"
Traceback: Traceback (most recent call last):
  File "/usr/lib64/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/_ert/async_utils.py", line 53, in _done_callback
    raise exc
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/job.py", line 131, in run
    await self._submit_and_run_once(sem)
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/job.py", line 99, in _submit_and_run_once
    await self.driver.submit(
  File "/prog/komodo/2024.06.rc1-py38-rhel8/root/lib64/python3.8/site-packages/ert/scheduler/openpbs_driver.py", line 214, in submit
    raise RuntimeError(process_message)
RuntimeError: Command "/opt/pbs/bin/qsub -rn -Nstress.ert-1 -q short -o /dev/null -e /dev/null -l select=1:ncpus=1" failed with exit code 160, output: "<empty>", and error: "Unknown Host.
qsub: cannot connect to server Please (errno=15008)"

Environment

Additional context The reason is that the default pbs server cannot be used, and it is expected that the user sets the server themselves.

$ cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=Please set SERVER_NAME in your environment
larsevj commented 2 weeks ago

Relates somewhat to #7112