Closed berland closed 1 year ago
Related reading: https://stackoverflow.com/questions/25172425/create-zombie-process
We should make dummy scripts / jobs that will make the scripts fail in many different ways. Eg., allocate too large disk volumes, memory, write to wrong locations, ....
This relates to job_dispatch execution. Then we should check ps aux | grep job_dispatch
for instance
Adding hints from @kwinkunks about how to fail nicely in Python:
In case we need it, this produces a SIGSEGV:
import ctypes
i = ctypes.c_char(b'a')
j = ctypes.pointer(i)
c = 0
while True:
j[c] = b'a'
c += 1
:skull_and_crossbones: This produces a SIGKILL
import sys
sys.setrecursionlimit(1<<30)
f = lambda f: f(f)
f(f)
One hypothesis is that killing children belonging to the same process group does not work due to this: https://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/4791612#4791612
According to the docs, it looks like zombie process is the forward model script (which has completed), but the job_dispatch froze in process.wait
when pooling the exit code. We need to emulate the zombie processes though.
The following deliberate bug in job.py
is at least able to reproduce the symptom:
diff --git a/src/_ert_job_runner/job.py b/src/_ert_job_runner/job.py
index 719832a6f..9bfc7a509 100644
--- a/src/_ert_job_runner/job.py
+++ b/src/_ert_job_runner/job.py
@@ -107,6 +107,9 @@ class Job:
yield Running(self, max_memory_usage, memory)
+ while True:
+ time.sleep(1)
+
try:
exit_code = process.wait(timeout=self.MEMORY_POLL_PERIOD)
except TimeoutExpired:
When running the poly case with this, one quickly gets this:
$ ps f -f
berland 609066 433119 4 19:06 pts/4 Sl+ 0:08 \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/ert test_run poly.ert
berland 609110 609066 0 19:06 pts/4 SNl 0:00 \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/job_dispatch.py /home/berland/projects/ert/test-data/poly_example/poly_out/realization-0/iter-0
berland 609113 609110 0 19:06 pts/4 ZN 0:00 \_ [python] <defunct>
With the deliberate bug above, then
$ ert test_run poly.ert &
$ killall ert; killall ert
will leave a zombie process. When the mothership ert is killed (SIGTERM requires two shots, SIGKILL only one) the job_dispatch subprocess is never killed and job_dispatch's subprocess remains a zombie.
Killing the job_dispatch.py's process leaves no zombie, and ERT exits (with failure).
(using ensemble_experiment
with local queue gives the same behaviour as with test_run
)
This problem seems reproducible given:
/lustre1
disk on Azure. Probably independent of queue system in use.
If RUNPATH is on /private
or similar, you will not be able to do rm -rf my_runpath
because there are .nfsxxxxx
lock files preventing it, and we will not end up in zombie situation.
This scenario might describe the current zombie processes:
[havb@s034-a0455cc04:~]$ ps f -fu f_scout_ci | tail -n 3
f_scout+ 22971 20288 0 Aug16 ? S 0:00 \_ /bin/sh /var/spool/pbs/mom_priv/jobs/71507.s034-lcam.SC
f_scout+ 22977 22971 0 Aug16 ? SNl 0:01 \_ /prog/komodo/bleeding-py38-rhel7/root/bin/python /prog/komodo/bleeding-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/iter-0
f_scout+ 29029 22977 0 Aug16 ? ZN 0:00 \_ [python] <defunct>
where we can verify that the mentioned RUNPATH is indeed missing:
[f_scout_ci@s034-a0455cc04 sens_analysis]$ ls -ld /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/
ls: cannot access /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/: No such file or directory
It is possible that the line
https://github.com/equinor/komodo-releases/blob/main/.github/workflows/run_reek_hm.yml#L243
occurs too early. In regular scenarios, this command is executed momentarily after the ert
main process is finished, and potentially all the job_dispatch subprocesses need some extra time before their runpaths can be wiped.
Some findings:
tldr:
A more specific way to obtain the zombie process is to do chmod 000 <runpath>/status.json
. This makes the job_dispatch.py process trigger an OSError, and becomes unable to do cleanup, yielding its child as a zombie. This is similar to removing the runpath, but this command is able to isolate the problem to the job_dispatch.py code.
Scenarios:
rm -rf /lustre1/users/<username>/poly_out
is issued after all jobs are started. poly_eval.py will fail (modified to depend on the filesystem being writable every second for a minute), but that error is captured. The ERT console gets error messages on "FileNotFoundError: No such file or directory status.json", this is from _ert_job_runner/reporting/file.py
. poly_eval.py will also fail, and become zombie processes when it is "done". This should be because job_dispatch.py has crashed, and will never wait()
for its subprocess.chmod 000 ..../poly_out/realization-*/iter-*/status.json
will also make the job_dispatch.py process fail, but it will not trip the subprocess running poly_eval.py. This leaves zombie processes when poly_eval.py finishes.
Describe the bug The job_dispatch process has been observed to remain on Azure compute nodes as a zombie process. This means the compute node is kept open with costs running.
For details, see https://app.slack.com/client/T02JL00JU/C02GLHN886R/thread/C02GLHN886R-1686054776.013559 (internal link)
To reproduce Not known. Look for "still running" jobs on the Azure compute cluster during quiet periods.
Expected behaviour job_dispatch should always end.
Environment
Additional context Add any other context about the problem here.