Job directories failed to be deleted for failed jobs

pcm32 commented 1 year ago

I see this error on the Galaxy logs everytime there is a failed job:

galaxy.jobs.runners.kubernetes DEBUG 2022-12-13 12:57:25,734 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] (513/gxy-galaxy-dev-skl7v) Terminated at user's request
galaxy.objectstore CRITICAL 2022-12-13 12:57:25,823 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] None delete error [Errno 13] Permission denied: 'fontlist-v330.json'

I suspect this is stopping the setup from deleting the job directory.

nuwang commented 1 year ago

What is fontlist-v330.json? Do you know where on the filesystem it's located? This looks like an object-store configuration issue maybe? I don't recall having seen this before.

pcm32 commented 1 year ago

Not sure really... I do see this on the job directories:

galaxy@galaxy-dev-job-0-5cdc9d777f-v6q4q:/galaxy/server/database/jobs_directory/_cleared_contents/000/627/20221214-164444$ ls -ltr
total 72
-rw-r--r-- 1 galaxy root    0 Dec 14 16:44 galaxy_627.o
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 inputs
-rwxr-xr-x 1 galaxy root 1109 Dec 14 16:44 tool_script.sh
-rwxr-xr-x 1 galaxy root 5900 Dec 14 16:44 galaxy_627.sh
-rw-r--r-- 1 root   root    0 Dec 14 16:44 memory_statement.log
-rw-r--r-- 1 root   root    2 Dec 14 16:44 __instrument_core_galaxy_slots
-rw-r--r-- 1 root   root    5 Dec 14 16:44 __instrument_core_galaxy_memory_mb
-rw-r--r-- 1 root   root   11 Dec 14 16:44 __instrument_core_epoch_start
drwxr-sr-x 2 root   root 4096 Dec 14 16:44 configs
drwxr-sr-x 2 root   root 4096 Dec 14 16:44 _working
drwxr-sr-x 2 root   root 4096 Dec 14 16:44 _outputs
drwxr-sr-x 2 root   root 4096 Dec 14 16:44 _configs
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 outputs
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 working
drwxr-sr-x 4 root   root 4096 Dec 14 16:44 home
-rw-r--r-- 1 root   root    4 Dec 14 16:44 galaxy_627.ec
-rw-r--r-- 1 root   root   11 Dec 14 16:44 __instrument_core_epoch_end
drwxr-sr-x 2 root   root 4096 Dec 14 16:44 tmp
-rw-r--r-- 1 galaxy root   62 Dec 14 16:44 galaxy_627.e

I'm guessing that some files owned by root might give some process issues when trying to delete? or this is well controlled?

pcm32 commented 1 year ago

I don't seem to find that fontlist-v330.json file there within at least.

pcm32 commented 1 year ago

ahhh, found them:

rm: cannot remove '000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json': Permission denied
rm: cannot remove '000/623/20221214-163640/home/.config/matplotlib': Permission denied

they are:

galaxy@galaxy-dev-job-0-5cdc9d777f-v6q4q:/galaxy/server/database/jobs_directory/_cleared_contents$ ls -l 000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json
-rw-r--r-- 1 root root 24399 Dec 14 16:36 000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json

so how does the tool get to write those files as root? are we missing to pass the 101 user to the jobs for some reason?

so I'm guessing these might be specific to some of the scanpy tools that we use, but I guess they shouldn't be written as root anyways, right:

pcm32 commented 1 year ago

Final runner setup looks like this in the job_conf.yml:

runners:
  k8s:
    k8s_cleanup_job: never
    k8s_extra_job_envs:
      HDF5_USE_FILE_LOCKING: "FALSE"
    k8s_fs_group_id: "101"
    k8s_galaxy_instance_id: 'galaxy-dev'
    k8s_interactivetools_ingress_annotations: |

      nginx.ingress.kubernetes.io/proxy-body-size: 10G
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    k8s_interactivetools_use_ssl: true
    k8s_job_ttl_secs_after_finished: 600
    k8s_namespace: 'default'
    k8s_persistent_volume_claims: |-
      vol-nfs:/galaxy/server/database,vol-nfs/cvmfsclone:/cvmfs/cloud.galaxyproject.org
    k8s_pod_priority_class: 'galaxy-dev-job-priority'
    k8s_pull_policy: IfNotPresent
    k8s_supplemental_group_id: "101"
    k8s_use_service_account: true
    load: galaxy.jobs.runners.kubernetes:KubernetesJobRunner

I think that k8s_cleanup_job this is only for k8s API Job object, not the working directory within the file system.

pcm32 commented 1 year ago

NFS mount inside the Galaxy job handler looks like this:

10.43.9.166:/export/pvc-4372166d-a268-4763-832b-2a5c1ac66330 on /galaxy/server/database type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.43.9.166,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.43.9.166)

do we need some root squash for the job container perhaps setting 101 on the exports (I'm guessing this is done in the StorageClass setting perhaps)?

nuwang commented 1 year ago

Yes ,this sure looks like an issue with what the tool is doing. While most biocontainers tools do run as root, this particular tool appears to be creating files without adding the fs_group_id as an owner. As a result, it's a root only file which galaxy can't clean up.

What if you reconfigure this particular tool in TPV, and force the user id to 101? Something like:

tools:
  .*scanpy.*:
    params:
       k8s_run_as_user_id: 101

pcm32 commented 1 year ago

Using k8s_run_as_user_id on the runner sorts the problem. I cannot yet do it on tpv because of the resubmissions issue, hopefully when that works and I can live only on tpv destinations, that could work. Thanks!

galaxyproject / galaxy-helm

Job directories failed to be deleted for failed jobs #398