equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
105 stars 107 forks source link

Unhandled oserrors in scheduler from `create_subprocess_exec` #8976

Open eivindjahren opened 1 month ago

eivindjahren commented 1 month ago
Unexpected exception in ensemble:
File "/prog/res.../ert/ensemble_evaluator/_ensemble.py", line 270, in _evaluate_inner
result = await self._scheduler.execute(min_required_realizations)
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/scheduler.py", line 288, in execute
await self._monitor_and_handle_tasks(scheduling_tasks)
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/scheduler.py", line 240, in _monitor_and_handle_tasks
raise task_exception
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/lsf_driver.py", line 414, in poll
process = await asyncio.create_subprocess_exec(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/subprocess.py", line 236, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/base_events.py", line 1630, in subprocess_exec
transport = await self._make_subprocess_transport(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/unix_events.py", line 197, in _make_subprocess_transport
transp = _UnixSubprocessTransport(self, protocol, args, shell,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/base_subprocess.py", line 36, in __init__
self._start(args=args, shell=shell, stdin=stdin, stdout=stdout,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/unix_events.py", line 789, in _start
self._proc = subprocess.Popen(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/subprocess.py", line 1655, in _execute_child
self.pid = _posixsubprocess.fork_exec(
Unexpected exception in ensemble:
OSError: [Errno 12] Cannot allocate memory

Currently this results in an ENSEMBLE_FAILED, but potentially we want some other behavior here.

JHolba commented 1 month ago

I disagree with only failing one realization. If we are hitting out of memory errors, then all bets are off. We could maybe try to shut down in a nicer way, but even that can be difficult unless you preallocate everything you need for the shutdown.

berland commented 1 month ago

Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.

jonathan-eq commented 2 weeks ago

Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.

Is this for all OSErrors from asyncio.create_subprocess_exec, or just the ones containing Cannot allocate memory?

berland commented 2 weeks ago

Is this for all OSErrors from asyncio.create_subprocess_exec, or just the ones containing Cannot allocate memory?

Only for an out-of-memory situation happening inside poll().

eivindjahren commented 1 week ago

I think we need to more closely investigate what is going on rather than a blank ignore. Seems like there is some python specific shenanigans going on see https://stackoverflow.com/questions/1367373/python-subprocess-popen-oserror-errno-12-cannot-allocate-memory. The solution migh be to back off our polling, do the subprocess call differently, or otherwise ensure that the subprocess call does not eat up too much memory.

I think we in any case need to try to reproduce a potential memory leak usage spike from ert due to polling. Note that this cannot be observed with local queue as it doesn't use asyncio.create_subprocess_exec. It could be that by simply running a long ert experiment on lsf, you will see rising memory usage when bjobs takes a long time to complete.