Open maikenp opened 2 years ago
It seems there are a few ways in which scancel can be configured on slurm (and I suppose by extension there are corresponding drmaa option), @natefoo do you have any insights here ? I suppose we'd want to make sure a SIGTERM is sent first ?
@maikenp Actually, it would be interesting too if you could kill an IT job with scancel
and scancel --signal=TERM
and see if the docker container continues running
Thanks for looking into this. So you want me to kill it manually on the command-line? To see if both or one of the scancels actually kills the docker container? I will do that.
Yes, exactly.
[root@hepp03 ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1400 main g2471_in galaxy R 0:03 1 hepp03.hpc.uio.no
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02 maikenp/docker-jupyter-fys5555:v2022-01-17 "tini -g -- /bin/sh …" 21 seconds ago Up 20 seconds 0.0.0.0:49157->8888/tcp, :::49157->8888/tcp 748093730058460c940a937eea4201f0
[root@hepp03 ~]# scancel --signal=TERM 1400
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02 maikenp/docker-jupyter-fys5555:v2022-01-17 "tini -g -- /bin/sh …" About a minute ago Up About a minute 0.0.0.0:49157->8888/tcp, :::49157->8888/tcp 748093730058460c940a937eea4201f0
[root@hepp03 ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1400 main g2471_in galaxy R 3:09 1 hepp03.hpc.uio.no
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02 maikenp/docker-jupyter-fys5555:v2022-01-17 "tini -g -- /bin/sh …" 3 minutes ago Up 3 minutes 0.0.0.0:49157->8888/tcp, :::49157->8888/tcp 748093730058460c940a937eea4201f0
[root@hepp03 ~]# scancel 1400
[root@hepp03 ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1401 main g2472_in galaxy R 5:05 1 hepp03.hpc.uio.no
(1401 is next job that I started 1400 is gone as you see in squeue)
scancel --signal=TERM does not cancel the job in slurm. scancel without argument (default is SIGKILL) cancels the job in slurm and causes an error in galaxy tool history, but container is still running.
By the way, we are using cgroups in slurm:
[root@hepp03 ~]# grep cgroup /etc/slurm/slurm.conf
TaskPlugin=task/affinity,task/cgroup
ProctrackType=proctrack/cgroup
Maybe this is a slurm configuration problem? Although cgroups should kill related processes...
Peanut gallery here.. If slurm is calling docker run
then it would be executed by the docker daemon user and no longer belong to the slurm cgroup. The only process that would be under the slurm user would be the docker cli process.
Hi. We have a similar issue. We are running interactive tools on galaxy relase 20.05 using pulsar_embedded as runner, as described in the GTN interactive tools guide at https://training.galaxyproject.org/training-material/topics/admin/tutorials/interactive-tools/tutorial.html .
This is the job_conf.xml file:
<plugins workers="4">
<plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
<plugin id="pulsar_embedded" type="runner" load="galaxy.jobs.runners.pulsar:PulsarEmbeddedJobRunner">
<param id="pulsar_config">/home/galaxy/galaxy/config/pulsar_app.yml</param>
</plugin>
</plugins>
<destinations default="local">
<destination id="local" runner="local"/>
<destination id="interactive_local" runner="pulsar_embedded">
<param id="docker_enabled">true</param>
<param id="docker_volumes">$defaults</param>
<param id="docker_sudo">false</param>
<param id="docker_net">bridge</param>
<param id="docker_auto_rm">true</param>
<param id="docker_set_user"></param>
<param id="require_container">true</param>
<param id="container_monitor_result">callback</param>
</destination>
</destinations>
This is the pulsar_config.yml file:
# The path where per-job directories will be created
staging_directory: "/export/database/job_working_dir/_interactive"
# Where Pulsar state information will be stored (e.g. currently active jobs)
persistence_directory: "/home/galaxy/galaxy/var/pulsar"
# Where to find Galaxy tool dependencies
tool_dependency_dir: "/export/tool_deps"
# How to run jobs (see https://pulsar.readthedocs.io/en/latest/job_managers.html)
managers:
_default_:
type: queued_python
num_concurrent_jobs: 1
The interactive tools work fine, but when they are stopped from the gui the container keeps running.
Docker run command retrived from the galaxy log:
docker run -e "GALAXY_SLOTS=$GALAXY_SLOTS" -e "HOME=$HOME" -e "_GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR" -e "_GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR" -e "TMPDIR=$TMPDIR" -e "TMP=$TMP" -e "TEMP=$TEMP" -p 8000 --name d9633218e0c6420aa4aeb9aa4736d0ed -v /export/database/job_working_dir/_interactive/79:/export/database/job_working_dir/_interactive/79:ro -v /export/database/job_working_dir/_interactive/79/tool_files:/export/database/job_working_dir/_interactive/79/tool_files:ro -v /export/database/job_working_dir/_interactive/79/outputs:/export/database/job_working_dir/_interactive/79/outputs:rw -v /export/database/job_working_dir/_interactive/79/working:/export/database/job_working_dir/_interactive/79/working:rw -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -w /export/database/job_working_dir/_interactive/79/working --net bridge --rm shiltemann/ethercalc-galaxy-ie:17.05 /bin/sh /export/database/job_working_dir/_interactive/79/tool_script.sh; return_code=$?; sh -c "exit $return_code"]
Galaxy log at the time of the GUI stop command:
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: galaxy.jobs.handler DEBUG 2022-03-16 14:52:44,741 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Stopping job 79 in pulsar_embedded runner
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: galaxy.jobs.runners.pulsar DEBUG 2022-03-16 14:52:44,775 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempt remote Pulsar kill of job with url pulsar_embedded and id 79
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: pulsar.managers.unqueued INFO 2022-03-16 14:52:44,775 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempting to kill job with job_id 79
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: pulsar.managers.unqueued INFO 2022-03-16 14:52:44,776 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempting to kill pid 23201
Docker ps after the interactive tool job stop:
[root@express-it-finale galaxy]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b0313e1d5025 shiltemann/ethercalc-galaxy-ie:17.05 "/bin/sh /export/dat…" About a minute ago Up About a minute 80/tcp, 0.0.0.0:49226->8000/tcp, :::49226->8000/tcp d9633218e0c6420aa4aeb9aa4736d0ed
The ID of the process which is killed is retrieved from the job working dir created by pulsar for the interactive tool, and it corresponds to the command.sh PID:
[root@express-it-finale galaxy]# cat /export/database/job_working_dir/_interactive/80/pid
"23622"
[root@express-it-finale galaxy]# ps -aux | grep 23622
galaxy 23622 0.0 0.0 113280 1480 ? S 14:54 0:00 /bin/bash /export/database/job_working_dir/_interactive/80/command.sh
Manually killing this process does not stop and remove the container:
[root@express-it-finale galaxy]# kill -9 23622; docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
157e9cfeb9bc shiltemann/ethercalc-galaxy-ie:17.05 "/bin/sh /export/dat…" 2 minutes ago Up 2 minutes 80/tcp, 0.0.0.0:49227->8000/tcp, :::49227->8000/tcp bcf605c3cecb4140b1ac67ca1d13b1ea
Instead, the container is succesfully stopped and removed by killing the PID of the interactive tool tool_script.sh:
[root@express-it-finale galaxy]# ps -aux | grep tool_script.sh
root 23753 0.0 0.0 4484 724 ? Ss 14:54 0:00 /bin/sh /export/database/job_working_dir/_interactive/80/tool_script.sh
[root@express-it-finale galaxy]# kill -9 23753
[root@express-it-finale galaxy]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
I've found a solution for stopping both the interactive tool job and the relative container.
I noticed that when using kill
to stop the job from the command line, the docker run process keeps running:
[root@express-it-finale centos]# cat /export/database/job_working_dir/_interactive/111/pid
"3757"
[root@express-it-finale centos]# kill -9 3757
[root@express-it-finale centos]# ps -aux | grep docker
galaxy 3789 0.0 0.8 883288 33548 ? Sl 07:30 0:00 docker run -e GALAXY_SLOTS=1 -e HOME=/home/galaxy/galaxy -e _GALAXY_JOB_HOME_DIR=/export/database/job_working_dir/_interactive/111/working/home -e _GALAXY_JOB_TMP_DIR= -e TMPDIR= -e TMP= -e TEMP= -e HISTORY_ID= -e REMOTE_HOST= -e GALAXY_WEB_PORT= -e GALAXY_URL= -e API_KEY= -p 8888 --name 7b85bd0eeb594fb1a4976224fdd91ecb -v /export/database/job_working_dir/_interactive/111:/export/database/job_working_dir/_interactive/111:ro -v /export/database/job_working_dir/_interactive/111/tool_files:/export/database/job_working_dir/_interactive/111/tool_files:ro -v /export/database/job_working_dir/_interactive/111/outputs:/export/database/job_working_dir/_interactive/111/outputs:rw -v /export/database/job_working_dir/_interactive/111/working:/export/database/job_working_dir/_interactive/111/working:rw -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -w /export/database/job_working_dir/_interactive/111/working --net bridge --rm quay.io/bgruening/docker-jupyter-notebook:ie2 /bin/sh /export/database/job_working_dir/_interactive/111/tool_script.sh
[root@express-it-finale centos]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2be9e1cb3b4e quay.io/bgruening/docker-jupyter-notebook:ie2 "tini -g -- /bin/sh …" 50 minutes ago Up 50 minutes 0.0.0.0:49254->8888/tcp, :::49254->8888/tcp 7b85bd0eeb594fb1a4976224fdd91ecb
Instead, when using pkill to stop the job from the command line, the job children processes and the container are stopped:
[root@express-it-finale centos]# cat /export/database/job_working_dir/_interactive/112/pid
"4399"
[root@express-it-finale centos]# pkill -9 -P 4399
[root@express-it-finale centos]# ps -aux | grep docker
root 2647 0.2 1.7 1327964 69704 ? Ssl Mar11 24:12 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
[root@express-it-finale centos]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
I then modified the kill_pid
function in the pulsar.managers.util.kill module (which is the one used by embedded pulsar to kill a job by its pid) to use pkill
when stopping a job from the gui.
This is the modified function:
def kill_pid(pid, use_psutil=True):
subprocess.Popen(args=f"pkill -9 -P {pid}", shell=True)
With this modification, the container is stopped and removed when stopping an interactive tool job in the gui. This solution is related to pulsar, so it may be a little off-topic with respect to @maikenp initial issue. Also, it doesn't consider windows-related functions in the same pulsar module, but I hope it may be anyway of some help for this issue.
These are two separate but related issues. As @innovate-invent points out, when you use docker run
, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foreground docker run
command is signalled, it should cause dockerd
to terminate the container. If we are running those containers detached (docker run -d
), then there is nothing to signal.
In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh
In the Pulsar queued_python
manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.
From https://dhruveshp.com/blog/2021/signal-propagation-on-slurm/:
slurm will send SIGTERM to the job and wait for certain amount of time before it sends the final SIGKILL.
So I would assume something like
diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
_on_exit() {{
{kill_command} &> /dev/null
}}
-trap _on_exit 0
+trap _on_exit 0 SIGTERM
{cache_command}
{run_command}"""
or
diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
_on_exit() {{
{kill_command} &> /dev/null
}}
trap _on_exit 0
+trap _on_exit SIGTERM
{cache_command}
{run_command}"""
would work? Anyone have a clean way to try this out quickly?
These are two separate but related issues. As @innovate-invent points out, when you use
docker run
, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foregrounddocker run
command is signalled, it should causedockerd
to terminate the container. If we are running those containers detached (docker run -d
), then there is nothing to signal.In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh
In the Pulsar
queued_python
manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.These are two separate but related issues. As @innovate-invent points out, when you use
docker run
, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foregrounddocker run
command is signalled, it should causedockerd
to terminate the container. If we are running those containers detached (docker run -d
), then there is nothing to signal.In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh
In the Pulsar
queued_python
manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.
Thanks for the suggestion. We finally implemented this, and it works as expected.
The only change being that the path to the container_config.json is in the subdir configs, not directly under $workdir
:
container_config="${workdir}/configs/container_config.json"
So it is fine that this can be handled in an Epilog script, but is there a way that it could be handled by galaxy instead?
From https://dhruveshp.com/blog/2021/signal-propagation-on-slurm/:
slurm will send SIGTERM to the job and wait for certain amount of time before it sends the final SIGKILL.
So I would assume something like
diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py index 0d92bc506f..82dd8143c2 100644 --- a/lib/galaxy/tool_util/deps/container_classes.py +++ b/lib/galaxy/tool_util/deps/container_classes.py @@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes): _on_exit() {{ {kill_command} &> /dev/null }} -trap _on_exit 0 +trap _on_exit 0 SIGTERM {cache_command} {run_command}"""
or
diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py index 0d92bc506f..82dd8143c2 100644 --- a/lib/galaxy/tool_util/deps/container_classes.py +++ b/lib/galaxy/tool_util/deps/container_classes.py @@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes): _on_exit() {{ {kill_command} &> /dev/null }} trap _on_exit 0 +trap _on_exit SIGTERM {cache_command} {run_command}"""
would work? Anyone have a clean way to try this out quickly?
Thanks, we did try both these suggestions. None had effect on our system. But as you see, the Epilog solution worked. Which is obvious of course since one just extract the container id from the corresponding slurm job and does docker kill. Which is fine I guess.
Describe the bug Galaxy interactive tools - when tool is stopped either by user, or by backend (slurm) since going over time-limit, the docker container keeps running on the machine.
This is on a local galaxy cluster galaxy-hepp.hpc.uio.no Slurm backend
This is our config for interactive jobs - we do use the docker_auto_rm:
And an example of the galaxy_.sh docker run command resulted from the settings:
Reproducing the error: 1) The job has started:
Job in slurm when the job is running:
The docker container on hepp03 just after the job started:
2) I stop the job in galaxy. The job disappears as expected in squeue output.
However container is still running on hepp03 after the job is stopped in galaxy:
Mention of the job in slurmd.log:
Result of
journalctl | grep 2469
which was the job number in galaxy:Galaxy Version and/or server at which you observed the bug Slurm backend. Galaxy Version: 21.09
To Reproduce Steps to reproduce the behavior:
Expected behavior When the interactive tool is stopped, the docker container should be stopped and removed on the compute node.
Additional context Add any other context about the problem here.