Open eivindjahren opened 1 month ago
I disagree with only failing one realization. If we are hitting out of memory errors, then all bets are off. We could maybe try to shut down in a nicer way, but even that can be difficult unless you preallocate everything you need for the shutdown.
Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.
Since this out-of-memory happens during lsf_driver.poll() (happening every 2 seconds), I think it makes sense to keep calm and carry on, that is ignore the OSError and either let the entire node go down, or hope the out-of-memory situation will resolve in other means (OOM-killer to the rescue, and bet it kills something else than Ert). The polling will recover on subsequent polls.
Is this for all OSErrors from asyncio.create_subprocess_exec
, or just the ones containing Cannot allocate memory
?
Is this for all OSErrors from
asyncio.create_subprocess_exec
, or just the ones containingCannot allocate memory
?
Only for an out-of-memory situation happening inside poll()
.
I think we need to more closely investigate what is going on rather than a blank ignore. Seems like there is some python specific shenanigans going on see https://stackoverflow.com/questions/1367373/python-subprocess-popen-oserror-errno-12-cannot-allocate-memory. The solution migh be to back off our polling, do the subprocess call differently, or otherwise ensure that the subprocess call does not eat up too much memory.
I think we in any case need to try to reproduce a potential memory leak usage spike from ert due to polling. Note that this cannot be observed with local queue as it doesn't use asyncio.create_subprocess_exec
. It could be that by simply running a long ert experiment on lsf, you will see rising memory usage when bjobs takes a long time to complete.
Currently this results in an
ENSEMBLE_FAILED
, but potentially we want some other behavior here.