equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
101 stars 104 forks source link

job_dispatch not exiting reliably on compute cluster nodes #5549

Closed berland closed 1 year ago

berland commented 1 year ago

Describe the bug The job_dispatch process has been observed to remain on Azure compute nodes as a zombie process. This means the compute node is kept open with costs running.

For details, see https://app.slack.com/client/T02JL00JU/C02GLHN886R/thread/C02GLHN886R-1686054776.013559 (internal link)

hrbu@s034-lcam ~]$ qstat -anw11

s034-lcam:
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
45489.s034-lcam                f_scout_ci      permanent       DROGON-0           20098    1     1    --    --  R 07:56 s034-a074b225f/0
45490.s034-lcam                f_scout_ci      permanent       DROGON-1           20100    1     1    --    --  R 07:56 s034-a074b225f/1
[hrbu@s034-lcam ~]$ ssh s034-a074b225f ps f  -fu f_scout_ci
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
f_scout+ 20100 14171  0 12:32 ?        Ss     0:00 -csh
f_scout+ 20160 20100  0 12:32 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/45490.s034-lcam.SC
f_scout+ 20162 20160  0 12:32 ?        SNl    0:01      \_ /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/python /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.mbeQpOT3cm/fmu-drogon/ert/model/scratch/f_scout_ci/01_drogon_design/realization-1/iter-0
f_scout+ 22200 20162  0 12:39 ?        ZN     0:00          \_ [python] <defunct>
f_scout+ 20098 14171  0 12:32 ?        Ss     0:00 -csh
f_scout+ 20161 20098  0 12:32 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/45489.s034-lcam.SC
f_scout+ 20163 20161  0 12:32 ?        SNl    0:01      \_ /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/python /prog/komodo/2023.06.rc4-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.mbeQpOT3cm/fmu-drogon/ert/model/scratch/f_scout_ci/01_drogon_design/realization-0/iter-0
f_scout+ 22201 20163  0 12:39 ?        ZN     0:00          \_ [python] <defunct>

To reproduce Not known. Look for "still running" jobs on the Azure compute cluster during quiet periods.

Expected behaviour job_dispatch should always end.

Environment

Additional context Add any other context about the problem here.

berland commented 1 year ago

Related reading: https://stackoverflow.com/questions/25172425/create-zombie-process

xjules commented 1 year ago

We should make dummy scripts / jobs that will make the scripts fail in many different ways. Eg., allocate too large disk volumes, memory, write to wrong locations, .... This relates to job_dispatch execution. Then we should check ps aux | grep job_dispatch for instance

xjules commented 1 year ago

Adding hints from @kwinkunks about how to fail nicely in Python:


In case we need it, this produces a SIGSEGV:

import ctypes
i = ctypes.c_char(b'a')
j = ctypes.pointer(i)
c = 0
while True:
  j[c] = b'a'
  c += 1

:skull_and_crossbones: This produces a SIGKILL

import sys
sys.setrecursionlimit(1<<30)
f = lambda f: f(f)
f(f)
xjules commented 1 year ago

One hypothesis is that killing children belonging to the same process group does not work due to this: https://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/4791612#4791612

According to the docs, it looks like zombie process is the forward model script (which has completed), but the job_dispatch froze in process.wait when pooling the exit code. We need to emulate the zombie processes though.

berland commented 1 year ago

The following deliberate bug in job.py is at least able to reproduce the symptom:

diff --git a/src/_ert_job_runner/job.py b/src/_ert_job_runner/job.py
index 719832a6f..9bfc7a509 100644
--- a/src/_ert_job_runner/job.py
+++ b/src/_ert_job_runner/job.py
@@ -107,6 +107,9 @@ class Job:

             yield Running(self, max_memory_usage, memory)

+            while True:
+                time.sleep(1)
+
             try:
                 exit_code = process.wait(timeout=self.MEMORY_POLL_PERIOD)
             except TimeoutExpired:

When running the poly case with this, one quickly gets this:

$ ps f -f
berland   609066  433119  4 19:06 pts/4    Sl+    0:08  \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/ert test_run poly.ert
berland   609110  609066  0 19:06 pts/4    SNl    0:00      \_ /home/berland/venv/newer/bin/python3 /home/berland/venv/newer/bin/job_dispatch.py /home/berland/projects/ert/test-data/poly_example/poly_out/realization-0/iter-0
berland   609113  609110  0 19:06 pts/4    ZN     0:00          \_ [python] <defunct>
berland commented 1 year ago

With the deliberate bug above, then

$ ert test_run poly.ert &
$ killall ert; killall ert

will leave a zombie process. When the mothership ert is killed (SIGTERM requires two shots, SIGKILL only one) the job_dispatch subprocess is never killed and job_dispatch's subprocess remains a zombie.

Killing the job_dispatch.py's process leaves no zombie, and ERT exits (with failure).

(using ensemble_experiment with local queue gives the same behaviour as with test_run)

berland commented 1 year ago

This problem seems reproducible given:

Probably independent of queue system in use.

If RUNPATH is on /private or similar, you will not be able to do rm -rf my_runpath because there are .nfsxxxxx lock files preventing it, and we will not end up in zombie situation.

berland commented 1 year ago

This scenario might describe the current zombie processes:

[havb@s034-a0455cc04:~]$ ps f -fu f_scout_ci | tail -n 3
f_scout+  22971  20288  0 Aug16 ?        S      0:00  \_ /bin/sh /var/spool/pbs/mom_priv/jobs/71507.s034-lcam.SC
f_scout+  22977  22971  0 Aug16 ?        SNl    0:01      \_ /prog/komodo/bleeding-py38-rhel7/root/bin/python /prog/komodo/bleeding-py38-rhel7/root/bin/job_dispatch.py /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/iter-0
f_scout+  29029  22977  0 Aug16 ?        ZN     0:00          \_ [python] <defunct>

where we can verify that the mentioned RUNPATH is indeed missing:

[f_scout_ci@s034-a0455cc04 sens_analysis]$ ls -ld /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/
ls: cannot access /lustre1/users/f_scout_ci/hm-tutorial-runs/tmp.t9SiMc1ANY/sens_analysis/f_scout_ci/sens_analysis/realization-20/: No such file or directory
berland commented 1 year ago

It is possible that the line

https://github.com/equinor/komodo-releases/blob/main/.github/workflows/run_reek_hm.yml#L243

occurs too early. In regular scenarios, this command is executed momentarily after the ert main process is finished, and potentially all the job_dispatch subprocesses need some extra time before their runpaths can be wiped.

berland commented 1 year ago

Some findings:

tldr: A more specific way to obtain the zombie process is to do chmod 000 <runpath>/status.json. This makes the job_dispatch.py process trigger an OSError, and becomes unable to do cleanup, yielding its child as a zombie. This is similar to removing the runpath, but this command is able to isolate the problem to the job_dispatch.py code.

Scenarios: