Unhandled oserrors in scheduler from `create_subprocess_exec`

eivindjahren commented 1 month ago

Unexpected exception in ensemble:
File "/prog/res.../ert/ensemble_evaluator/_ensemble.py", line 270, in _evaluate_inner
result = await self._scheduler.execute(min_required_realizations)
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/scheduler.py", line 288, in execute
await self._monitor_and_handle_tasks(scheduling_tasks)
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/scheduler.py", line 240, in _monitor_and_handle_tasks
raise task_exception
Unexpected exception in ensemble:
File "/prog/res.../ert/scheduler/lsf_driver.py", line 414, in poll
process = await asyncio.create_subprocess_exec(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/subprocess.py", line 236, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/base_events.py", line 1630, in subprocess_exec
transport = await self._make_subprocess_transport(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/unix_events.py", line 197, in _make_subprocess_transport
transp = _UnixSubprocessTransport(self, protocol, args, shell,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/base_subprocess.py", line 36, in __init__
self._start(args=args, shell=shell, stdin=stdin, stdout=stdout,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/asyncio/unix_events.py", line 789, in _start
self._proc = subprocess.Popen(
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
Unexpected exception in ensemble:
File "/usr/lib64/python3.8/subprocess.py", line 1655, in _execute_child
self.pid = _posixsubprocess.fork_exec(
Unexpected exception in ensemble:
OSError: [Errno 12] Cannot allocate memory

Currently this results in an ENSEMBLE_FAILED, but potentially we want some other behavior here.

JHolba commented 1 month ago

I disagree with only failing one realization. If we are hitting out of memory errors, then all bets are off. We could maybe try to shut down in a nicer way, but even that can be difficult unless you preallocate everything you need for the shutdown.

berland commented 1 month ago

Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.

jonathan-eq commented 2 weeks ago

Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.

Is this for all OSErrors from asyncio.create_subprocess_exec, or just the ones containing Cannot allocate memory?

berland commented 2 weeks ago

Is this for all OSErrors from asyncio.create_subprocess_exec, or just the ones containing Cannot allocate memory?

Only for an out-of-memory situation happening inside poll().

eivindjahren commented 1 week ago

I think we need to more closely investigate what is going on rather than a blank ignore. Seems like there is some python specific shenanigans going on see https://stackoverflow.com/questions/1367373/python-subprocess-popen-oserror-errno-12-cannot-allocate-memory. The solution migh be to back off our polling, do the subprocess call differently, or otherwise ensure that the subprocess call does not eat up too much memory.

I think we in any case need to try to reproduce a potential memory leak usage spike from ert due to polling. Note that this cannot be observed with local queue as it doesn't use asyncio.create_subprocess_exec. It could be that by simply running a long ert experiment on lsf, you will see rising memory usage when bjobs takes a long time to complete.

equinor / ert

Unhandled oserrors in scheduler from `create_subprocess_exec` #8976