job_dispatch not exiting reliably on compute cluster nodes

berland commented 1 year ago

Describe the bug The job_dispatch process has been observed to remain on Azure compute nodes as a zombie process. This means the compute node is kept open with costs running.

For details, see https://app.slack.com/client/T02JL00JU/C02GLHN886R/thread/C02GLHN886R-1686054776.013559 (internal link)

hrbu@s034-lcam ~]$ qstat -anw11

s034-lcam:
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
45489.s034-lcam                f_scout_ci      permanent       DROGON-0           20098    1     1    --    --  R 07:56 s034-a074b225f/0
45490.s034-lcam                f_scout_ci      permanent       DROGON-1           20100    1     1    --    --  R 07:56 s034-a074b225f/1
[hrbu@s034-lcam ~]$ ssh s034-a074b225f ps f  -fu f_scout_ci
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
f_scout+ 20100 14171  0 12:32 ?        Ss     0:00 -csh
f_scout+ 20160 20100  0 12:32 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/45490.s034-lcam.SC
f_scout+ 20162 20160  0 12:32 ?        SNl    0:01      \_ /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/python /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.mbeQpOT3cm/fmu-drogon/ert/model/scratch/f_scout_ci/01_drogon_design/realization-1/iter-0
f_scout+ 22200 20162  0 12:39 ?        ZN     0:00          \_ [python] <defunct>
f_scout+ 20098 14171  0 12:32 ?        Ss     0:00 -csh
f_scout+ 20161 20098  0 12:32 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/45489.s034-lcam.SC
f_scout+ 20163 20161  0 12:32 ?        SNl    0:01      \_ /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/python /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.mbeQpOT3cm/fmu-drogon/ert/model/scratch/f_scout_ci/01_drogon_design/realization-0/iter-0
f_scout+ 22201 20163  0 12:39 ?        ZN     0:00          \_ [python] <defunct>

To reproduce Not known. Look for "still running" jobs on the Azure compute cluster during quiet periods.

Expected behaviour job_dispatch should always end.

Environment

OS: RHEL7
ERT/Komodo release: bleeding
Python version: 3.8
Remote/HPC execution involved: Torque

Additional context Add any other context about the problem here.

berland commented 1 year ago

xjules commented 1 year ago

We should make dummy scripts / jobs that will make the scripts fail in many different ways. Eg., allocate too large disk volumes, memory, write to wrong locations, .... This relates to job_dispatch execution. Then we should check ps aux | grep job_dispatch for instance

xjules commented 1 year ago

Adding hints from @kwinkunks about how to fail nicely in Python:

In case we need it, this produces a SIGSEGV:

import ctypes
i = ctypes.c_char(b'a')
j = ctypes.pointer(i)
c = 0
while True:
  j[c] = b'a'
  c += 1

:skull_and_crossbones: This produces a SIGKILL

import sys
sys.setrecursionlimit(1<<30)
f = lambda f: f(f)
f(f)

xjules commented 1 year ago

One hypothesis is that killing children belonging to the same process group does not work due to this: https://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/4791612#4791612

According to the docs, it looks like zombie process is the forward model script (which has completed), but the job_dispatch froze in process.wait when pooling the exit code. We need to emulate the zombie processes though.

berland commented 1 year ago

The following deliberate bug in job.py is at least able to reproduce the symptom:

diff --git a/src/_ert_job_runner/job.py b/src/_ert_job_runner/job.py
index 719832a6f..9bfc7a509 100644
--- a/src/_ert_job_runner/job.py
+++ b/src/_ert_job_runner/job.py
@@ -107,6 +107,9 @@ class Job:

             yield Running(self, max_memory_usage, memory)

+            while True:
+                time.sleep(1)
+
             try:
                 exit_code = process.wait(timeout=self.MEMORY_POLL_PERIOD)
             except TimeoutExpired:

When running the poly case with this, one quickly gets this:

$ ps f -f
berland   609066  433119  4 19:06 pts/4    Sl+    0:08  \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/ert test_run poly.ert
berland   609110  609066  0 19:06 pts/4    SNl    0:00      \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/job_dispatch.py /home/berland/projects/ert/test-data/poly_example/poly_out/realization-0/iter-0
berland   609113  609110  0 19:06 pts/4    ZN     0:00          \_ [python] <defunct>

berland commented 1 year ago

With the deliberate bug above, then

$ ert test_run poly.ert &
$ killall ert; killall ert

will leave a zombie process. When the mothership ert is killed (SIGTERM requires two shots, SIGKILL only one) the job_dispatch subprocess is never killed and job_dispatch's subprocess remains a zombie.

Killing the job_dispatch.py's process leaves no zombie, and ERT exits (with failure).

(using ensemble_experiment with local queue gives the same behaviour as with test_run)

berland commented 1 year ago

This problem seems reproducible given:

RUNPATH is deleted during the course of the simulation (by any means)
RUNPATH is on the /lustre1 disk on Azure.

Probably independent of queue system in use.

If RUNPATH is on /private or similar, you will not be able to do rm -rf my_runpath because there are .nfsxxxxx lock files preventing it, and we will not end up in zombie situation.

berland commented 1 year ago

This scenario might describe the current zombie processes:

[havb@s034-a0455cc04:~]$ ps f -fu f_scout_ci | tail -n 3
f_scout+  22971  20288  0 Aug16 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/71507.s034-lcam.SC
f_scout+  22977  22971  0 Aug16 ?        SNl    0:01      \_ /prog/komodo/bleeding-py38-rhel7/root/bin/python /prog/komodo/bleeding-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/iter-0
f_scout+  29029  22977  0 Aug16 ?        ZN     0:00          \_ [python] <defunct>

where we can verify that the mentioned RUNPATH is indeed missing:

[f_scout_ci@s034-a0455cc04 sens_analysis]$ ls -ld /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/
ls: cannot access /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/: No such file or directory

berland commented 1 year ago

It is possible that the line

https://github.com/equinor/komodo-releases/blob/main/.github/workflows/run_reek_hm.yml#L243

occurs too early. In regular scenarios, this command is executed momentarily after the ert main process is finished, and potentially all the job_dispatch subprocesses need some extra time before their runpaths can be wiped.

berland commented 1 year ago

Some findings:

tldr: A more specific way to obtain the zombie process is to do chmod 000 <runpath>/status.json. This makes the job_dispatch.py process trigger an OSError, and becomes unable to do cleanup, yielding its child as a zombie. This is similar to removing the runpath, but this command is able to isolate the problem to the job_dispatch.py code.

Scenarios:

Ensemble is initiated on local queue, but ert main process is ctrl-c'ed. Then all job_dispatch processes continue until they are finished. This is a feature, not a bug. The first ctrl-c signal "crashes" ERT, but the process remains. The second ctrl-c brings the ERT process down, and the job_dispatch processes becomes orphans (LOCAL queue) but eventually disappears when they are done. This can take some minutes.
rm -rf /lustre1/users/<username>/poly_out is issued after all jobs are started. poly_eval.py will fail (modified to depend on the filesystem being writable every second for a minute), but that error is captured. The ERT console gets error messages on "FileNotFoundError: No such file or directory status.json", this is from _ert_job_runner/reporting/file.py. poly_eval.py will also fail, and become zombie processes when it is "done". This should be because job_dispatch.py has crashed, and will never wait() for its subprocess.
chmod 000 ..../poly_out/realization-*/iter-*/status.json will also make the job_dispatch.py process fail, but it will not trip the subprocess running poly_eval.py. This leaves zombie processes when poly_eval.py finishes.

equinor / ert

job_dispatch not exiting reliably on compute cluster nodes #5549