equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
103 stars 107 forks source link

Terminate experiment often cause crash due to missing netcdf file #6564

Closed larsevj closed 1 month ago

larsevj commented 1 year ago

Describe the bug When running an ES MDA experiment on either Poly Example or Snake oil, then pressing Terminate Experiment just before an iteration is about to be completed causes the GUI to freeze and ERT continues to run the entire experiment to the end in the background. When running the Poly example, if you press Terminate experiment after some, but not all realizations have finished, then ERT seems to continue with another iteration in the background and will eventually stop with a KeyError printed to the console.

To reproduce Steps to reproduce the behaviour:

  1. pip install ert
  2. ert gui snake_oil.ert
  3. Run experiment ESMDA
  4. Press Terminate Experiment when all jobs are finished but the border is still yellow.

Expected behaviour Expect ERT to stop all processes, and not continue on new iterations.

Screenshots image

Environment

andreas-el commented 12 months ago

Tested on 2023.11.rc6

Running ES-MDA and terminating early (when first job is complete, but most are not ..)

Adding trace for completeness.

(2023.11.rc6-py38) [andrli@st-linrgs306 ert]$ ert gui test-data/poly_example/poly.ert 
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 209, in _acquire_with_cache_info
    file = self._cache[self._key]
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/lru_cache.py", line 55, in __getitem__
    value = self._cache[key]
KeyError: [<function _open_scipy_netcdf at 0x7fa5dc60f790>, ('/private/andrli/project/ert/test-data/poly_example/storage/ensembles/ddb4ee97-d968-4284-b517-82abf188047a/realization-54/COEFFS.nc',), 'r', (('mmap', None), ('version', 2)), 'f1045a87-8599-4285-9c93-ec6d23547cb3']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 169, in _load_single_dataset
    return xr.open_dataset(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/api.py", line 541, in open_dataset
    backend_ds = backend.open_dataset(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 307, in open_dataset
    ds = store_entrypoint.open_dataset(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/store.py", line 32, in open_dataset
    vars, attrs = store.load()
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/common.py", line 128, in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 174, in get_variables
    (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 163, in ds
    return self._manager.acquire()
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 191, in acquire
    file, _ = self._acquire_with_cache_info(needs_lock)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 215, in _acquire_with_cache_info
    file = self._opener(*self._args, **kwargs)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 102, in _open_scipy_netcdf
    return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/scipy/io/_netcdf.py", line 246, in __init__
    self.fp = open(self.filename, '%sb' % omode)
FileNotFoundError: [Errno 2] No such file or directory: '/private/andrli/project/ert/test-data/poly_example/storage/ensembles/ddb4ee97-d968-4284-b517-82abf188047a/realization-54/COEFFS.nc'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/gui/simulation/run_dialog.py", line 273, in run
    self._run_model.startSimulations(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/base_run_model.py", line 235, in startSimulations
    run_context = self.run_experiment(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/multiple_data_assimilation.py", line 180, in run_experiment
    self._evaluate_and_postprocess(posterior_context, evaluator_server_config)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/base_run_model.py", line 474, in _evaluate_and_postprocess
    create_run_path(run_context, self.substitution_list, self.ert_config)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/enkf_main.py", line 260, in create_run_path
    _generate_parameter_files(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/enkf_main.py", line 126, in _generate_parameter_files
    export_values = node.write_to_runpath(Path(run_path), iens, fs)
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/config/gen_kw_config.py", line 229, in write_to_runpath
    array = ensemble.load_parameters(self.name, real_nr, var="transformed_values")
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 208, in load_parameters
    return self._load_dataset(group, realizations)[var]
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 184, in _load_dataset
    return self._load_single_dataset(group, realizations).isel(
  File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 174, in _load_single_dataset
    raise KeyError(
KeyError: "No dataset 'COEFFS' in storage for realization 54"
dafeda commented 11 months ago

Trying to reproduce locally. I first set MAX_RUNNING to 1 (QUEUE_OPTION LOCAL MAX_RUNNING 1), let ert run for two iterations and then hit Terminate Experiment. This seems to work:

Image

I then set MAX_RUNNING to 2 and did the same, which also seems to work:

Image

Do you expect this to fail @larsevj ?

larsevj commented 11 months ago

I still see the netCDF4 warning printed to the terminal, and it still seems to continue to the next update step. image This was with MAX_RUNNING 5, and running poly.ert.

dafeda commented 11 months ago

I am able to reproduce this using poly.ert, but not using snake_oil.ert.

oyvindeide commented 11 months ago

Marking this as blocked because we might want to do a refactor of the tracker and propagation of messages. There is also a lot of refactoring in the queue which might impact this. Lets revisit, and see if we can reproduce in a while, if not, just close it.

sondreso commented 1 month ago

We should check the logs for this error. If there are none, this issue can be closed since we have not been able to reproduce it in more than 6 months.

oyvindeide commented 1 month ago

No sign of this in the logs, so closing this issue.