equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
101 stars 104 forks source link

Terminating jobs fails using jobqueue #8101

Closed andreas-el closed 2 months ago

andreas-el commented 3 months ago

Using 2024.06.02 with scheduler set to False on RGS.

Also reproducible on 2024.02, 2024.04 & 2024.05 releases, although these do not yield the error-dialog.

QUEUE_SYSTEM LSF

Running poly-example Ensemble Experiment, and terminating jobs half-way through.

Screenshot 2024-06-07 at 09 42 39

Job <651004>: Job has already finished
Job <651007>: Job has already finished
Job <651009>: Job has already finished
Job <651011>: Job has already finished
Job <651012>: Job has already finished
Job <651014>: Job has already finished
Job <651013>: Job has already finished
Job <651015>: Job has already finished
Job <651017>: Job has already finished
Job <651018>: Job has already finished
Job <651020>: Job has already finished
Job <651019>: Job has already finished
Job <651021>: Job has already finished
Job <651022>: Job has already finished
Job <651023>: Job has already finished
Job <651024>: Job has already finished
Job <651025>: Job has already finished
Job <651026>: Job has already finished
Job <651027>: Job has already finished
Job <651028>: Job has already finished
Job <651030>: Job has already finished
Job <651031>: Job has already finished
Job <651033>: Job has already finished
Job <651036>: Job has already finished
Job <651039>: Job has already finished
Job <651040>: Job has already finished
Job <651041>: Job has already finished
Job <651044>: Job has already finished
Job <651043>: Job has already finished
Job <651046>: Job has already finished
Job <651047>: Job has already finished
Job <651050>: Job has already finished
Job <651051>: Job has already finished
Job <651052>: Job has already finished
Job <651054>: Job has already finished
Job <651053>: Job has already finished
Job <651055>: Job has already finished
Job <651056>: Job has already finished
Job <651057>: Job has already finished
Job <651058>: Job has already finished
Job <651059>: Job has already finished
Job <651060>: Job has already finished
andreas-el commented 3 months ago

The jobqueue does not terminate the jobs (seemingly) in earlier versions of komodo either. 2024.05.04: UI blocked, jobs continue to run, stdout contains 'Job has already finished' 2024.05.00: UI blocked, jobs continue to run, stdout contains 'Job has already finished' 2024.04.10: UI blocked, jobs continue to run, stdout contains 'Job has already finished' ⚠️ 2024.03: not released 2024.02.12: UI blocked, jobs continue to run, stdout contains 'Job has already finished'

But what differs is that the "Experiment failed" dialog does not appear in these versions.

Terminated jobs around job ~10 done, still it continues.

[andrli@st-linrgs176 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
656396  andrli  RUN   mr         st-vgrid02  st-rsv17-17 *ly.ert-83 Jun  7 12:34
656401  andrli  RUN   mr         st-vgrid02  st-rsv17-17 *ly.ert-88 Jun  7 12:34
656402  andrli  RUN   mr         st-vgrid02  st-rsv17-17 *ly.ert-89 Jun  7 12:34
656403  andrli  RUN   mr         st-vgrid02  st-rsv17-17 *ly.ert-90 Jun  7 12:34
656404  andrli  PEND  mr         st-vgrid02              *ly.ert-91 Jun  7 12:34
656405  andrli  PEND  mr         st-vgrid02              *ly.ert-92 Jun  7 12:34
andreas-el commented 3 months ago

This might be tied into the MAX_RUNNING keyword. Testing was performed using the default value 2 (?), where we expect most users to set a rather high value for this, resulting in all jobs being submitted immediately.

andreas-el commented 3 months ago

We should also keep in mind that LSF10 was introduced not a long time ago.

xjules commented 2 months ago

Let's see if the same behaviour happens in scheduler. Update: it does not; ie. the jobs are terminated correctly.


We can keep this open until scheduler is enabled by default.

sondreso commented 2 months ago

This has been a problem in the jobqueue for a long time, and given that it is fixed in the scheduler, we will not prioritize fixing this.