galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.4k stars 1.01k forks source link

Galaxy interactive tools with slurm and docker - docker container is not removed when tool is finished. #13511

Open maikenp opened 2 years ago

maikenp commented 2 years ago

Describe the bug Galaxy interactive tools - when tool is stopped either by user, or by backend (slurm) since going over time-limit, the docker container keeps running on the machine.

This is on a local galaxy cluster galaxy-hepp.hpc.uio.no Slurm backend

This is our config for interactive jobs - we do use the docker_auto_rm:

 <destination id="interactive_slurm" runner="slurm">
            <param id="nativeSpecification">--time=24:00:00 --mem=100</param>
            <param id="docker_enabled">true</param>
            <param id="docker_volumes">$defaults,....plus som more... </param>
            <param id="docker_sudo">false</param>
            <param id="docker_net">bridge</param>
            <param id="docker_auto_rm">true</param>
            <param id="docker_set_user"></param>
            <param id="require_container">true</param>
        </destination>

And an example of the galaxy_.sh docker run command resulted from the settings:

docker run -e "GALAXY_SLOTS=$GALAXY_SLOTS" -e "GALAXY_MEMORY_MB=$GALAXY_MEMORY_MB" -e "GALAXY_MEMORY_MB_PER_SLOT=$GALAXY_MEMORY_MB_PER_SLOT" -e "HOME=$HOME" -e "_GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR" -e "_GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR" -e "TMPDIR=$TMPDIR" -e "TMP=$TMP" -e "TEMP=$TEMP" -e "HISTORY_ID=$HISTORY_ID" -e "REMOTE_HOST=$REMOTE_HOST" -e "GALAXY_WEB_PORT=$GALAXY_WEB_PORT" -e "GALAXY_URL=$GALAXY_URL" -e "API_KEY=$API_KEY" -p 8888 --name 095bd57e7a9d4c5b8ec0e1f19f148e67 -v /srv/galaxy/server:/srv/galaxy/server:ro -....removed-all-the-volume-mounts.... --cpus ${GALAXY_SLOTS:-1} -w /storage/galaxy/jobs_directory/002/2467/working --net bridge --rm maikenp/docker-jupyter-fys5555:v2022-01-17 /bin/sh /storage/galaxy/jobs_directory/002/2467/tool_script.sh > ../outputs/tool_stdout 2> ../outputs/tool_stderr; return_code=$?; cd '/storage/galaxy/jobs_directory/002/2467'; 

Reproducing the error: 1) The job has started:

Screenshot 2022-03-09 at 15 51 26

Job in slurm when the job is running:

[maikenp@hepp03 ~]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1398      main g2469_in   galaxy  R       1:13      1 hepp03.hpc.uio.no

The docker container on hepp03 just after the job started:

docker p[maikenp@hepp03 ~]$ docker ps
CONTAINER ID   IMAGE                                                                   COMMAND                  CREATED          STATUS          PORTS                                                   NAMES
8ad0832e3412   maikenp/docker-jupyter-fys5555:v2022-01-17                              "tini -g -- /bin/sh …"   16 seconds ago   Up 15 seconds   0.0.0.0:49155->8888/tcp, :::49155->8888/tcp             e1e3317ee18846799e65413cd8cffa98

2) I stop the job in galaxy. The job disappears as expected in squeue output.

Screenshot 2022-03-09 at 15 51 45
[maikenp@hepp03 ~]$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

However container is still running on hepp03 after the job is stopped in galaxy:

[maikenp@hepp03 ~]$ docker ps
CONTAINER ID   IMAGE                                                                   COMMAND                  CREATED          STATUS          PORTS                                                   NAMES
8ad0832e3412   maikenp/docker-jupyter-fys5555:v2022-01-17                              "tini -g -- /bin/sh …"   2 minutes ago    Up 2 minutes    0.0.0.0:49155->8888/tcp, :::49155->8888/tcp             e1e3317ee18846799e65413cd8cffa98

Mention of the job in slurmd.log:

[2022-03-09T15:49:36.147] ====================
[2022-03-09T15:49:36.147] batch_job:1398 job_mem:100MB
[2022-03-09T15:49:36.147] JobNode[0] CPU[0] Job alloc
[2022-03-09T15:49:36.147] ====================
[2022-03-09T15:49:36.147] Launching batch job 1398 for UID 182649
[2022-03-09T15:49:36.283] [1398.batch] task/cgroup: _memcg_initialize: /slurm/uid_182649/job_1398: alloc=100MB mem.limit=100MB memsw.limit=unlimited
[2022-03-09T15:49:36.284] [1398.batch] task/cgroup: _memcg_initialize: /slurm/uid_182649/job_1398/step_batch: alloc=100MB mem.limit=100MB memsw.limit=unlimited
[2022-03-09T15:51:30.881] [1398.batch] error: *** JOB 1398 ON hepp03.hpc.uio.no CANCELLED AT 2022-03-09T15:51:30 ***
[2022-03-09T15:51:31.902] [1398.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
[2022-03-09T15:51:31.906] [1398.batch] done with job

Result of journalctl | grep 2469 which was the job number in galaxy:

Mar 09 15:49:34 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.web_stack.transport DEBUG 2022-03-09 15:49:34,669 [pN:main.web.1,p:2277100,w:1,m:0,tN:uWSGIWorker1Core0] Sending message to farm job-handlers: {"target": "job_handler", "params": {"task": "setup", "job_id": 2469}, "__classname__": "JobHandlerMessage"}
Mar 09 15:49:34 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.web_stack.handlers INFO 2022-03-09 15:49:34,669 [pN:main.web.1,p:2277100,w:1,m:0,tN:uWSGIWorker1Core0] (Job[id=2469,tool_id=interactive_tool_jupyter_notebook_fys5555]) Handler 'job-handlers' assigned using 'HANDLER_ASSIGNMENT_METHODS.UWSGI_MULE_MESSAGE' assignment method
Mar 09 15:49:34 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.web_stack.transport DEBUG 2022-03-09 15:49:34,669 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:UWSGIFarmMessageTransport.dispatcher_thread] Received message: {"target": "job_handler", "params": {"task": "setup", "job_id": 2469}, "__classname__": "JobHandlerMessage"}
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.mapper DEBUG 2022-03-09 15:49:35,494 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] (2469) Mapped job to destination id: None
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.handler DEBUG 2022-03-09 15:49:35,507 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] (2469) Dispatching to slurm runner
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs DEBUG 2022-03-09 15:49:35,516 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] (2469) Persisting job destination (destination id: None)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs DEBUG 2022-03-09 15:49:35,525 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] (2469) Working directory for job is: /storage/galaxy/jobs_directory/002/2469
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners DEBUG 2022-03-09 15:49:35,530 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] Job [2469] queued (22.960 ms)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.handler INFO 2022-03-09 15:49:35,534 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerQueue.monitor_thread] (2469) Job dispatched
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs DEBUG 2022-03-09 15:49:35,591 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] Job wrapper for Job [2469] prepared (49.347 ms)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.command_factory INFO 2022-03-09 15:49:35,614 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] Built script [/storage/galaxy/jobs_directory/002/2469/tool_script.sh] for tool command [. /etc/profile.d/eb_env.sh && . /root/.profile &&   export GALAXY_WORKING_DIR=`pwd` && mkdir -p ./jupyter/ && mkdir -p ./jupyter/outputs/ && mkdir -p ./jupyter/data &&  ln -sf 'None' './jupyter/data/1_None.data' &&   cd ./jupyter/ && export HOME=/home/jovyan/ && export PATH=/home/jovyan/.local/bin:$PATH &&   cp '/srv/galaxy/local_tools/interactive/README.ipynb' ./README.ipynb && jupyter trust ./README.ipynb &&   cp -r /storage/shared/software/Input ./ && jupyter trust ./Input/OpenDataPandaFramework13TeV.ipynb  &&  cp -r /storage/shared/software/Notebooks ./ && jupyter trust ./Notebooks/ATLASOpenData/13TeV/*.ipynb && jupyter trust ./Notebooks/Polarquest2018/*.ipynb && jupyter trust ./Notebooks/Statistics/*.ipynb && jupyter trust ./Notebooks/SUSYPhenoPlot/*.ipynb &&  cp -r /storage/shared/software/Scripts ./ &&       jupyter lab --allow-root --no-browser  --ServerApp.token='']
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners DEBUG 2022-03-09 15:49:35,663 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] (2469) command is: (python '/srv/galaxy/server'/lib/galaxy_ext/container_monitor/monitor.py &); sleep 1 ; mkdir -p working outputs configs
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: docker run -e "GALAXY_SLOTS=$GALAXY_SLOTS" -e "GALAXY_MEMORY_MB=$GALAXY_MEMORY_MB" -e "GALAXY_MEMORY_MB_PER_SLOT=$GALAXY_MEMORY_MB_PER_SLOT" -e "HOME=$HOME" -e "_GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR" -e "_GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR" -e "TMPDIR=$TMPDIR" -e "TMP=$TMP" -e "TEMP=$TEMP" -e "HISTORY_ID=$HISTORY_ID" -e "REMOTE_HOST=$REMOTE_HOST" -e "GALAXY_WEB_PORT=$GALAXY_WEB_PORT" -e "GALAXY_URL=$GALAXY_URL" -e "API_KEY=$API_KEY" -p 8888 --name e1e3317ee18846799e65413cd8cffa98 -v /srv/galaxy/server:/srv/galaxy/server:ro -v /srv/galaxy/local_tools/interactive:/srv/galaxy/local_tools/interactive:ro -v /storage/galaxy/jobs_directory/002/2469:/storage/galaxy/jobs_directory/002/2469:ro -v /storage/galaxy/jobs_directory/002/2469/outputs:/storage/galaxy/jobs_directory/002/2469/outputs:rw -v /storage/galaxy/jobs_directory/002/2469/configs:/storage/galaxy/jobs_directory/002/2469/configs:rw -v "$_GALAXY_JOB_TMP_DIR:$_GALAXY_JOB_TMP_DIR:rw" -v /storage/galaxy/jobs_directory/002/2469/working:/storage/galaxy/jobs_directory/002/2469/working:rw -v /storage/galaxy/files:/storage/galaxy/files:rw -v /srv/galaxy/var/tool-data:/srv/galaxy/var/tool-data:ro -v /srv/galaxy/var/tool-data:/srv/galaxy/var/tool-data:ro -v /storage/shared/data:/storage/shared/data:ro -v /storage/shared/software:/storage/shared/software:ro -v /storage/software:/storage/software:ro --cpus ${GALAXY_SLOTS:-1} -w /storage/galaxy/jobs_directory/002/2469/working --net bridge --rm maikenp/docker-jupyter-fys5555:v2022-01-17 /bin/sh /storage/galaxy/jobs_directory/002/2469/tool_script.sh > ../outputs/tool_stdout 2> ../outputs/tool_stderr; return_code=$?; cd '/storage/galaxy/jobs_directory/002/2469';
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa DEBUG 2022-03-09 15:49:35,685 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] (2469) submitting file /storage/galaxy/jobs_directory/002/2469/galaxy_2469.sh
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa DEBUG 2022-03-09 15:49:35,685 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] (2469) native specification is: --cpus-per-task=1 --mem=100 --time=24:00:00
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> drmaa_set_attribute(jt=0x7f6028119a80, name='drmaa_remote_command', value='/storage/galaxy/jobs_directory/002/2469/galaxy_2469.sh')
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> fsd_template_set_attr(drmaa_remote_command=/storage/galaxy/jobs_directory/002/2469/galaxy_2469.sh)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> drmaa_set_attribute(jt=0x7f6028119a80, name='drmaa_job_name', value='g2469_interactive_tool_jupyter_notebook_fys5555_hikingm_yahoo_no')
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> fsd_template_set_attr(drmaa_job_name=g2469_interactive_tool_jupyter_notebook_fys5555_hikingm_yahoo_no)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> drmaa_set_attribute(jt=0x7f6028119a80, name='drmaa_wd', value='/storage/galaxy/jobs_directory/002/2469')
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> fsd_template_set_attr(drmaa_wd=/storage/galaxy/jobs_directory/002/2469)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> drmaa_set_attribute(jt=0x7f6028119a80, name='drmaa_output_path', value=':/storage/galaxy/jobs_directory/002/2469/galaxy_2469.o')
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> fsd_template_set_attr(drmaa_output_path=:/storage/galaxy/jobs_directory/002/2469/galaxy_2469.o)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> drmaa_set_attribute(jt=0x7f6028119a80, name='drmaa_error_path', value=':/storage/galaxy/jobs_directory/002/2469/galaxy_2469.e')
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: t #22bf0c [713157.01] -> fsd_template_set_attr(drmaa_error_path=:/storage/galaxy/jobs_directory/002/2469/galaxy_2469.e)
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  * # job_name = g2469_interactive_tool_jupyter_notebook_fys5555_hikingm_yahoo_no
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  | /storage/galaxy/jobs_directory/002/2469/galaxy_2469.sh
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  * # work_dir = /storage/galaxy/jobs_directory/002/2469
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  |   drmaa_output_path: :/storage/galaxy/jobs_directory/002/2469/galaxy_2469.o -> /storage/galaxy/jobs_directory/002/2469/galaxy_2469.o
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  |   drmaa_error_path: :/storage/galaxy/jobs_directory/002/2469/galaxy_2469.e -> /storage/galaxy/jobs_directory/0022469/galaxy_2469.e
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  * # output = /storage/galaxy/jobs_directory/002/2469/galaxy_2469.o
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: d #22bf0c [713157.01]  * # error = /storage/galaxy/jobs_directory/002/2469/galaxy_2469.e
Mar 09 15:49:35 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa INFO 2022-03-09 15:49:35,689 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-0] (2469) queued as 1398
Mar 09 15:49:36 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa DEBUG 2022-03-09 15:49:36,237 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.monitor_thread] (2469/1398) state change: job is running
Mar 09 15:51:29 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.managers.interactivetool DEBUG 2022-03-09 15:51:29,974 [pN:main.web.1,p:2277100,w:1,m:0,tN:uWSGIWorker1Core0] Stopping Job: <galaxy.model.Job(2469) at 0x7f6035b0e7b8> for InteractiveToolEntryPoint: <galaxy.model.InteractiveToolEntryPoint(1152) at 0x7f6035b0eb00>
Mar 09 15:51:30 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.handler DEBUG 2022-03-09 15:51:30,874 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerStopQueue.monitor_thread] Stopping job 2469 in slurm runner
Mar 09 15:51:30 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa INFO 2022-03-09 15:51:30,877 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:JobHandlerStopQueue.monitor_thread] (2469/1398) Removed from DRM queue at user's request
Mar 09 15:51:31 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.drmaa DEBUG 2022-03-09 15:51:31,152 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.monitor_thread] (2469/1398) state change: job finished, but failed
Mar 09 15:51:31 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.slurm DEBUG 2022-03-09 15:51:31,206 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.monitor_thread] (2469/1398) Waiting 1 seconds for failed job to exit COMPLETING state for post-mortem
Mar 09 15:51:32 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.slurm DEBUG 2022-03-09 15:51:32,241 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.monitor_thread] (2469/1398) Waiting 2 seconds for failed job to exit COMPLETING state for post-mortem
Mar 09 15:51:34 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs.runners.slurm DEBUG 2022-03-09 15:51:34,275 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.monitor_thread] Checking /storage/galaxy/jobs_directory/002/2469/galaxy_2469.e for exceeded memory message from SLURM
Mar 09 15:51:34 galaxy-hepp.hpc.uio.no uwsgi[2277003]: galaxy.jobs DEBUG 2022-03-09 15:51:34,453 [pN:main.job-handlers.1,p:2277103,w:0,m:1,tN:SlurmRunner.work_thread-1] job_wrapper.finish for job 2469 executed (113.178 ms)

Galaxy Version and/or server at which you observed the bug Slurm backend. Galaxy Version: 21.09

To Reproduce Steps to reproduce the behavior:

  1. Go to galaxy-hepp.hpc.uio.no
  2. Click on Interactive Tools -> Noteboooks for FYS5555 or Notebooks for ML - start job with default settings
  3. Once job has started, let it run a bit, then stop it via the Active Interactive Tools menu
  4. The jobs stops as expected in galaxy, but the docker container keeps running on the compute node

Expected behavior When the interactive tool is stopped, the docker container should be stopped and removed on the compute node.

Additional context Add any other context about the problem here.

mvdbeek commented 2 years ago

It seems there are a few ways in which scancel can be configured on slurm (and I suppose by extension there are corresponding drmaa option), @natefoo do you have any insights here ? I suppose we'd want to make sure a SIGTERM is sent first ?

mvdbeek commented 2 years ago

@maikenp Actually, it would be interesting too if you could kill an IT job with scancel and scancel --signal=TERM and see if the docker container continues running

maikenp commented 2 years ago

Thanks for looking into this. So you want me to kill it manually on the command-line? To see if both or one of the scancels actually kills the docker container? I will do that.

mvdbeek commented 2 years ago

Yes, exactly.

maikenp commented 2 years ago
[root@hepp03 ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1400      main g2471_in   galaxy  R       0:03      1 hepp03.hpc.uio.no
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02   maikenp/docker-jupyter-fys5555:v2022-01-17                              "tini -g -- /bin/sh …"   21 seconds ago      Up 20 seconds      0.0.0.0:49157->8888/tcp, :::49157->8888/tcp             748093730058460c940a937eea4201f0
[root@hepp03 ~]# scancel --signal=TERM 1400
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02   maikenp/docker-jupyter-fys5555:v2022-01-17                              "tini -g -- /bin/sh …"   About a minute ago   Up About a minute   0.0.0.0:49157->8888/tcp, :::49157->8888/tcp             748093730058460c940a937eea4201f0
[root@hepp03 ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1400      main g2471_in   galaxy  R       3:09      1 hepp03.hpc.uio.no
[root@hepp03 ~]# docker ps | grep fys5555
8d192d231a02   maikenp/docker-jupyter-fys5555:v2022-01-17                              "tini -g -- /bin/sh …"   3 minutes ago       Up 3 minutes       0.0.0.0:49157->8888/tcp, :::49157->8888/tcp             748093730058460c940a937eea4201f0

[root@hepp03 ~]# scancel 1400
[root@hepp03 ~]# squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1401      main g2472_in   galaxy  R       5:05      1 hepp03.hpc.uio.no

(1401 is next job that I started 1400 is gone as you see in squeue)

scancel --signal=TERM does not cancel the job in slurm. scancel without argument (default is SIGKILL) cancels the job in slurm and causes an error in galaxy tool history, but container is still running.

Screenshot 2022-03-09 at 21 39 06
maikenp commented 2 years ago

By the way, we are using cgroups in slurm:

[root@hepp03 ~]# grep cgroup /etc/slurm/slurm.conf
TaskPlugin=task/affinity,task/cgroup
ProctrackType=proctrack/cgroup

Maybe this is a slurm configuration problem? Although cgroups should kill related processes...

innovate-invent commented 2 years ago

Peanut gallery here.. If slurm is calling docker run then it would be executed by the docker daemon user and no longer belong to the slurm cgroup. The only process that would be under the slurm user would be the docker cli process.

dacolombo commented 2 years ago

Hi. We have a similar issue. We are running interactive tools on galaxy relase 20.05 using pulsar_embedded as runner, as described in the GTN interactive tools guide at https://training.galaxyproject.org/training-material/topics/admin/tutorials/interactive-tools/tutorial.html .

This is the job_conf.xml file:

<plugins workers="4">
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
        <plugin id="pulsar_embedded" type="runner" load="galaxy.jobs.runners.pulsar:PulsarEmbeddedJobRunner">
            <param id="pulsar_config">/home/galaxy/galaxy/config/pulsar_app.yml</param>
        </plugin>
    </plugins>
    <destinations default="local">
        <destination id="local" runner="local"/>
        <destination id="interactive_local" runner="pulsar_embedded">
            <param id="docker_enabled">true</param>
            <param id="docker_volumes">$defaults</param>
            <param id="docker_sudo">false</param>
            <param id="docker_net">bridge</param>
            <param id="docker_auto_rm">true</param>
            <param id="docker_set_user"></param>
            <param id="require_container">true</param>
            <param id="container_monitor_result">callback</param>
        </destination>
        </destinations>

This is the pulsar_config.yml file:

# The path where per-job directories will be created
staging_directory: "/export/database/job_working_dir/_interactive"

# Where Pulsar state information will be stored (e.g. currently active jobs)
persistence_directory: "/home/galaxy/galaxy/var/pulsar"

# Where to find Galaxy tool dependencies
tool_dependency_dir: "/export/tool_deps"

# How to run jobs (see https://pulsar.readthedocs.io/en/latest/job_managers.html)
managers:
  _default_:
    type: queued_python
    num_concurrent_jobs: 1

The interactive tools work fine, but when they are stopped from the gui the container keeps running.

Docker run command retrived from the galaxy log:

docker run -e "GALAXY_SLOTS=$GALAXY_SLOTS" -e "HOME=$HOME" -e "_GALAXY_JOB_HOME_DIR=$_GALAXY_JOB_HOME_DIR" -e "_GALAXY_JOB_TMP_DIR=$_GALAXY_JOB_TMP_DIR" -e "TMPDIR=$TMPDIR" -e "TMP=$TMP" -e "TEMP=$TEMP" -p 8000 --name d9633218e0c6420aa4aeb9aa4736d0ed -v /export/database/job_working_dir/_interactive/79:/export/database/job_working_dir/_interactive/79:ro -v /export/database/job_working_dir/_interactive/79/tool_files:/export/database/job_working_dir/_interactive/79/tool_files:ro -v /export/database/job_working_dir/_interactive/79/outputs:/export/database/job_working_dir/_interactive/79/outputs:rw -v /export/database/job_working_dir/_interactive/79/working:/export/database/job_working_dir/_interactive/79/working:rw -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -w /export/database/job_working_dir/_interactive/79/working --net bridge --rm shiltemann/ethercalc-galaxy-ie:17.05 /bin/sh /export/database/job_working_dir/_interactive/79/tool_script.sh; return_code=$?; sh -c "exit $return_code"]

Galaxy log at the time of the GUI stop command:

Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: galaxy.jobs.handler DEBUG 2022-03-16 14:52:44,741 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Stopping job 79 in pulsar_embedded runner
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: galaxy.jobs.runners.pulsar DEBUG 2022-03-16 14:52:44,775 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempt remote Pulsar kill of job with url pulsar_embedded and id 79
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: pulsar.managers.unqueued INFO 2022-03-16 14:52:44,775 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempting to kill job with job_id 79
Mar 16 14:52:44 express-it-finale.cloud.ba.infn.it uwsgi[22853]: pulsar.managers.unqueued INFO 2022-03-16 14:52:44,776 [p:22868,w:0,m:1] [JobHandlerStopQueue.monitor_thread] Attempting to kill pid 23201

Docker ps after the interactive tool job stop:

[root@express-it-finale galaxy]# docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED              STATUS              PORTS                                                 NAMES
b0313e1d5025   shiltemann/ethercalc-galaxy-ie:17.05   "/bin/sh /export/dat…"   About a minute ago   Up About a minute   80/tcp, 0.0.0.0:49226->8000/tcp, :::49226->8000/tcp   d9633218e0c6420aa4aeb9aa4736d0ed

The ID of the process which is killed is retrieved from the job working dir created by pulsar for the interactive tool, and it corresponds to the command.sh PID:

[root@express-it-finale galaxy]# cat /export/database/job_working_dir/_interactive/80/pid
"23622"
[root@express-it-finale galaxy]# ps -aux | grep 23622
galaxy   23622  0.0  0.0 113280  1480 ?        S    14:54   0:00 /bin/bash /export/database/job_working_dir/_interactive/80/command.sh

Manually killing this process does not stop and remove the container:

[root@express-it-finale galaxy]# kill -9 23622; docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED         STATUS         PORTS                                                 NAMES
157e9cfeb9bc   shiltemann/ethercalc-galaxy-ie:17.05   "/bin/sh /export/dat…"   2 minutes ago   Up 2 minutes   80/tcp, 0.0.0.0:49227->8000/tcp, :::49227->8000/tcp   bcf605c3cecb4140b1ac67ca1d13b1ea

Instead, the container is succesfully stopped and removed by killing the PID of the interactive tool tool_script.sh:

[root@express-it-finale galaxy]# ps -aux | grep tool_script.sh
root     23753  0.0  0.0   4484   724 ?        Ss   14:54   0:00 /bin/sh /export/database/job_working_dir/_interactive/80/tool_script.sh
[root@express-it-finale galaxy]# kill -9 23753
[root@express-it-finale galaxy]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
dacolombo commented 2 years ago

I've found a solution for stopping both the interactive tool job and the relative container. I noticed that when using kill to stop the job from the command line, the docker run process keeps running:

[root@express-it-finale centos]# cat /export/database/job_working_dir/_interactive/111/pid
"3757"
[root@express-it-finale centos]# kill -9 3757
[root@express-it-finale centos]# ps -aux | grep docker
galaxy    3789  0.0  0.8 883288 33548 ?        Sl   07:30   0:00 docker run -e GALAXY_SLOTS=1 -e HOME=/home/galaxy/galaxy -e _GALAXY_JOB_HOME_DIR=/export/database/job_working_dir/_interactive/111/working/home -e _GALAXY_JOB_TMP_DIR= -e TMPDIR= -e TMP= -e TEMP= -e HISTORY_ID= -e REMOTE_HOST= -e GALAXY_WEB_PORT= -e GALAXY_URL= -e API_KEY= -p 8888 --name 7b85bd0eeb594fb1a4976224fdd91ecb -v /export/database/job_working_dir/_interactive/111:/export/database/job_working_dir/_interactive/111:ro -v /export/database/job_working_dir/_interactive/111/tool_files:/export/database/job_working_dir/_interactive/111/tool_files:ro -v /export/database/job_working_dir/_interactive/111/outputs:/export/database/job_working_dir/_interactive/111/outputs:rw -v /export/database/job_working_dir/_interactive/111/working:/export/database/job_working_dir/_interactive/111/working:rw -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -v /home/galaxy/galaxy/server/tool-data:/home/galaxy/galaxy/server/tool-data:ro -w /export/database/job_working_dir/_interactive/111/working --net bridge --rm quay.io/bgruening/docker-jupyter-notebook:ie2 /bin/sh /export/database/job_working_dir/_interactive/111/tool_script.sh
[root@express-it-finale centos]# docker ps
CONTAINER ID   IMAGE                                           COMMAND                  CREATED          STATUS          PORTS                                         NAMES
2be9e1cb3b4e   quay.io/bgruening/docker-jupyter-notebook:ie2   "tini -g -- /bin/sh …"   50 minutes ago   Up 50 minutes   0.0.0.0:49254->8888/tcp, :::49254->8888/tcp   7b85bd0eeb594fb1a4976224fdd91ecb

Instead, when using pkill to stop the job from the command line, the job children processes and the container are stopped:

[root@express-it-finale centos]# cat /export/database/job_working_dir/_interactive/112/pid
"4399"
[root@express-it-finale centos]# pkill -9 -P 4399
[root@express-it-finale centos]# ps -aux | grep docker
root      2647  0.2  1.7 1327964 69704 ?       Ssl  Mar11  24:12 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
[root@express-it-finale centos]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

I then modified the kill_pid function in the pulsar.managers.util.kill module (which is the one used by embedded pulsar to kill a job by its pid) to use pkill when stopping a job from the gui. This is the modified function:

def kill_pid(pid, use_psutil=True):
    subprocess.Popen(args=f"pkill -9 -P {pid}", shell=True)

With this modification, the container is stopped and removed when stopping an interactive tool job in the gui. This solution is related to pulsar, so it may be a little off-topic with respect to @maikenp initial issue. Also, it doesn't consider windows-related functions in the same pulsar module, but I hope it may be anyway of some help for this issue.

natefoo commented 2 years ago

These are two separate but related issues. As @innovate-invent points out, when you use docker run, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foreground docker run command is signalled, it should cause dockerd to terminate the container. If we are running those containers detached (docker run -d), then there is nothing to signal.

In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh

In the Pulsar queued_python manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.

jmchilton commented 2 years ago

From https://dhruveshp.com/blog/2021/signal-propagation-on-slurm/:

slurm will send SIGTERM to the job and wait for certain amount of time before it sends the final SIGKILL.

So I would assume something like

diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
 _on_exit() {{
   {kill_command} &> /dev/null
 }}
-trap _on_exit 0
+trap _on_exit 0 SIGTERM
 {cache_command}
 {run_command}"""

or

diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
 _on_exit() {{
   {kill_command} &> /dev/null
 }}
trap _on_exit 0
+trap _on_exit SIGTERM
 {cache_command}
 {run_command}"""

would work? Anyone have a clean way to try this out quickly?

maikenp commented 2 years ago

These are two separate but related issues. As @innovate-invent points out, when you use docker run, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foreground docker run command is signalled, it should cause dockerd to terminate the container. If we are running those containers detached (docker run -d), then there is nothing to signal.

In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh

In the Pulsar queued_python manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.

These are two separate but related issues. As @innovate-invent points out, when you use docker run, your actual container is run by the persistent docker daemon running on the host, and so it "escapes" the process environment of the job. That said, when a foreground docker run command is signalled, it should cause dockerd to terminate the container. If we are running those containers detached (docker run -d), then there is nothing to signal.

In a one-off workshop instance I had set up, I used this Slurm epilog script to clean up containers after the job: https://github.com/natefoo/usegalaxy-clone/blob/master/files/slurm/epilog.sh

In the Pulsar queued_python manager, we should probably be killing the process group, not the pid, but if we're running detached containers that still won't fix it.

Thanks for the suggestion. We finally implemented this, and it works as expected.

The only change being that the path to the container_config.json is in the subdir configs, not directly under $workdir: container_config="${workdir}/configs/container_config.json"

So it is fine that this can be handled in an Epilog script, but is there a way that it could be handled by galaxy instead?

maikenp commented 2 years ago

From https://dhruveshp.com/blog/2021/signal-propagation-on-slurm/:

slurm will send SIGTERM to the job and wait for certain amount of time before it sends the final SIGKILL.

So I would assume something like

diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
 _on_exit() {{
   {kill_command} &> /dev/null
 }}
-trap _on_exit 0
+trap _on_exit 0 SIGTERM
 {cache_command}
 {run_command}"""

or

diff --git a/lib/galaxy/tool_util/deps/container_classes.py b/lib/galaxy/tool_util/deps/container_classes.py
index 0d92bc506f..82dd8143c2 100644
--- a/lib/galaxy/tool_util/deps/container_classes.py
+++ b/lib/galaxy/tool_util/deps/container_classes.py
@@ -349,7 +349,7 @@ class DockerContainer(Container, HasDockerLikeVolumes):
 _on_exit() {{
   {kill_command} &> /dev/null
 }}
trap _on_exit 0
+trap _on_exit SIGTERM
 {cache_command}
 {run_command}"""

would work? Anyone have a clean way to try this out quickly?

Thanks, we did try both these suggestions. None had effect on our system. But as you see, the Epilog solution worked. Which is obvious of course since one just extract the container id from the corresponding slurm job and does docker kill. Which is fine I guess.