Closed pcm32 closed 1 year ago
What is fontlist-v330.json
? Do you know where on the filesystem it's located? This looks like an object-store configuration issue maybe? I don't recall having seen this before.
Not sure really... I do see this on the job directories:
galaxy@galaxy-dev-job-0-5cdc9d777f-v6q4q:/galaxy/server/database/jobs_directory/_cleared_contents/000/627/20221214-164444$ ls -ltr
total 72
-rw-r--r-- 1 galaxy root 0 Dec 14 16:44 galaxy_627.o
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 inputs
-rwxr-xr-x 1 galaxy root 1109 Dec 14 16:44 tool_script.sh
-rwxr-xr-x 1 galaxy root 5900 Dec 14 16:44 galaxy_627.sh
-rw-r--r-- 1 root root 0 Dec 14 16:44 memory_statement.log
-rw-r--r-- 1 root root 2 Dec 14 16:44 __instrument_core_galaxy_slots
-rw-r--r-- 1 root root 5 Dec 14 16:44 __instrument_core_galaxy_memory_mb
-rw-r--r-- 1 root root 11 Dec 14 16:44 __instrument_core_epoch_start
drwxr-sr-x 2 root root 4096 Dec 14 16:44 configs
drwxr-sr-x 2 root root 4096 Dec 14 16:44 _working
drwxr-sr-x 2 root root 4096 Dec 14 16:44 _outputs
drwxr-sr-x 2 root root 4096 Dec 14 16:44 _configs
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 outputs
drwxr-sr-x 2 galaxy root 4096 Dec 14 16:44 working
drwxr-sr-x 4 root root 4096 Dec 14 16:44 home
-rw-r--r-- 1 root root 4 Dec 14 16:44 galaxy_627.ec
-rw-r--r-- 1 root root 11 Dec 14 16:44 __instrument_core_epoch_end
drwxr-sr-x 2 root root 4096 Dec 14 16:44 tmp
-rw-r--r-- 1 galaxy root 62 Dec 14 16:44 galaxy_627.e
I'm guessing that some files owned by root might give some process issues when trying to delete? or this is well controlled?
I don't seem to find that fontlist-v330.json
file there within at least.
ahhh, found them:
rm: cannot remove '000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json': Permission denied
rm: cannot remove '000/623/20221214-163640/home/.config/matplotlib': Permission denied
they are:
galaxy@galaxy-dev-job-0-5cdc9d777f-v6q4q:/galaxy/server/database/jobs_directory/_cleared_contents$ ls -l 000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json
-rw-r--r-- 1 root root 24399 Dec 14 16:36 000/623/20221214-163640/home/.cache/matplotlib/fontlist-v330.json
so how does the tool get to write those files as root? are we missing to pass the 101
user to the jobs for some reason?
so I'm guessing these might be specific to some of the scanpy tools that we use, but I guess they shouldn't be written as root anyways, right:
Final runner setup looks like this in the job_conf.yml:
runners:
k8s:
k8s_cleanup_job: never
k8s_extra_job_envs:
HDF5_USE_FILE_LOCKING: "FALSE"
k8s_fs_group_id: "101"
k8s_galaxy_instance_id: 'galaxy-dev'
k8s_interactivetools_ingress_annotations: |
nginx.ingress.kubernetes.io/proxy-body-size: 10G
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
k8s_interactivetools_use_ssl: true
k8s_job_ttl_secs_after_finished: 600
k8s_namespace: 'default'
k8s_persistent_volume_claims: |-
vol-nfs:/galaxy/server/database,vol-nfs/cvmfsclone:/cvmfs/cloud.galaxyproject.org
k8s_pod_priority_class: 'galaxy-dev-job-priority'
k8s_pull_policy: IfNotPresent
k8s_supplemental_group_id: "101"
k8s_use_service_account: true
load: galaxy.jobs.runners.kubernetes:KubernetesJobRunner
I think that k8s_cleanup_job
this is only for k8s API Job object, not the working directory within the file system.
NFS mount inside the Galaxy job handler looks like this:
10.43.9.166:/export/pvc-4372166d-a268-4763-832b-2a5c1ac66330 on /galaxy/server/database type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.43.9.166,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=10.43.9.166)
do we need some root squash for the job container perhaps setting 101 on the exports (I'm guessing this is done in the StorageClass setting perhaps)?
Yes ,this sure looks like an issue with what the tool is doing. While most biocontainers tools do run as root, this particular tool appears to be creating files without adding the fs_group_id as an owner. As a result, it's a root only file which galaxy can't clean up.
What if you reconfigure this particular tool in TPV, and force the user id to 101? Something like:
tools:
.*scanpy.*:
params:
k8s_run_as_user_id: 101
Using k8s_run_as_user_id
on the runner sorts the problem. I cannot yet do it on tpv because of the resubmissions issue, hopefully when that works and I can live only on tpv destinations, that could work. Thanks!
I see this error on the Galaxy logs everytime there is a failed job:
I suspect this is stopping the setup from deleting the job directory.