Closed larsevj closed 1 month ago
Tested on 2023.11.rc6
Running ES-MDA and terminating early (when first job is complete, but most are not ..)
Adding trace for completeness.
(2023.11.rc6-py38) [andrli@st-linrgs306 ert]$ ert gui test-data/poly_example/poly.ert
Exception in thread ert_gui_simulation_thread:
Traceback (most recent call last):
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 209, in _acquire_with_cache_info
file = self._cache[self._key]
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/lru_cache.py", line 55, in __getitem__
value = self._cache[key]
KeyError: [<function _open_scipy_netcdf at 0x7fa5dc60f790>, ('/private/andrli/project/ert/test-data/poly_example/storage/ensembles/ddb4ee97-d968-4284-b517-82abf188047a/realization-54/COEFFS.nc',), 'r', (('mmap', None), ('version', 2)), 'f1045a87-8599-4285-9c93-ec6d23547cb3']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 169, in _load_single_dataset
return xr.open_dataset(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/api.py", line 541, in open_dataset
backend_ds = backend.open_dataset(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 307, in open_dataset
ds = store_entrypoint.open_dataset(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/store.py", line 32, in open_dataset
vars, attrs = store.load()
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/common.py", line 128, in load
(_decode_variable_name(k), v) for k, v in self.get_variables().items()
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 174, in get_variables
(k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items()
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 163, in ds
return self._manager.acquire()
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 191, in acquire
file, _ = self._acquire_with_cache_info(needs_lock)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/file_manager.py", line 215, in _acquire_with_cache_info
file = self._opener(*self._args, **kwargs)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib/python3.8/site-packages/xarray/backends/scipy_.py", line 102, in _open_scipy_netcdf
return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/scipy/io/_netcdf.py", line 246, in __init__
self.fp = open(self.filename, '%sb' % omode)
FileNotFoundError: [Errno 2] No such file or directory: '/private/andrli/project/ert/test-data/poly_example/storage/ensembles/ddb4ee97-d968-4284-b517-82abf188047a/realization-54/COEFFS.nc'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/rh/rh-python38/root/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/opt/rh/rh-python38/root/usr/lib64/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/gui/simulation/run_dialog.py", line 273, in run
self._run_model.startSimulations(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/base_run_model.py", line 235, in startSimulations
run_context = self.run_experiment(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/multiple_data_assimilation.py", line 180, in run_experiment
self._evaluate_and_postprocess(posterior_context, evaluator_server_config)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/run_models/base_run_model.py", line 474, in _evaluate_and_postprocess
create_run_path(run_context, self.substitution_list, self.ert_config)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/enkf_main.py", line 260, in create_run_path
_generate_parameter_files(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/enkf_main.py", line 126, in _generate_parameter_files
export_values = node.write_to_runpath(Path(run_path), iens, fs)
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/config/gen_kw_config.py", line 229, in write_to_runpath
array = ensemble.load_parameters(self.name, real_nr, var="transformed_values")
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 208, in load_parameters
return self._load_dataset(group, realizations)[var]
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 184, in _load_dataset
return self._load_single_dataset(group, realizations).isel(
File "/prog/res/komodo/2023.11.rc6-py38-rhel7/root/lib64/python3.8/site-packages/ert/storage/local_ensemble.py", line 174, in _load_single_dataset
raise KeyError(
KeyError: "No dataset 'COEFFS' in storage for realization 54"
Trying to reproduce locally.
I first set MAX_RUNNING
to 1 (QUEUE_OPTION LOCAL MAX_RUNNING 1
), let ert
run for two iterations and then hit Terminate Experiment
. This seems to work:
I then set MAX_RUNNING
to 2 and did the same, which also seems to work:
Do you expect this to fail @larsevj ?
I still see the netCDF4 warning printed to the terminal, and it still seems to continue to the next update step.
This was with MAX_RUNNING 5, and running poly.ert
.
I am able to reproduce this using poly.ert
, but not using snake_oil.ert
.
Marking this as blocked because we might want to do a refactor of the tracker and propagation of messages. There is also a lot of refactoring in the queue which might impact this. Lets revisit, and see if we can reproduce in a while, if not, just close it.
We should check the logs for this error. If there are none, this issue can be closed since we have not been able to reproduce it in more than 6 months.
No sign of this in the logs, so closing this issue.
Describe the bug When running an ES MDA experiment on either Poly Example or Snake oil, then pressing Terminate Experiment just before an iteration is about to be completed causes the GUI to freeze and ERT continues to run the entire experiment to the end in the background. When running the Poly example, if you press Terminate experiment after some, but not all realizations have finished, then ERT seems to continue with another iteration in the background and will eventually stop with a KeyError printed to the console.
To reproduce Steps to reproduce the behaviour:
pip install ert
ert gui snake_oil.ert
Expected behaviour Expect ERT to stop all processes, and not continue on new iterations.
Screenshots
Environment