equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
101 stars 104 forks source link

Local driver will not handle an exception from running jobs #8220

Closed berland closed 3 months ago

berland commented 3 months ago

If an exception (other than async cancellation) happens while running the realization/job through subprocess occurs, the LocalDriver will send a FinishedEvent saying that the return code is zero. Then Ert will hang and there is no trace of the exception to be found in the logs.

This has not occured in real life currently, but maybe it can happen. For now, this behaviour can be triggered by:

[havb@be-lx139213:/data/projects/ert/test-data/poly_example] main$ git diff
diff --git a/src/ert/scheduler/local_driver.py b/src/ert/scheduler/local_driver.py
index 8d07d2eed..3c0a6cddc 100644
--- a/src/ert/scheduler/local_driver.py
+++ b/src/ert/scheduler/local_driver.py
@@ -105,7 +105,7 @@ async def _init(iens: int, executable: str, /, *args: str) -> Process:

     @staticmethod
     async def _wait(proc: Process) -> int:
-        return await proc.wait()
+        raise ValueError

     @staticmethod
     async def _kill(proc: Process) -> int:

If ert ensemble_experiment poly.ert is run, ert will run the realization and eventually claim that 100/100 of the realizations are Finished, and then it will just hang.

Expected behaviour is 100/100 Failed realizations, and the exception visibile somewhere.