galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes

MIT License

38 stars 36 forks source link

Cannot delete files/folders in the galaxy database owned by root users #478

Closed DuttaAnik closed 5 days ago

DuttaAnik commented 1 month ago

Hello, I have a similar problem that was mentioned in a previous post #398. Currently, I am inside the galaxy-job pod and then went to the database directory which looks like this:

galaxy/server/database$ ls
cache             config      deps        objects         shed_tools  tool-data  tool_search_index
celery-beat-schedule  cvmfsclone  jobs_directory  object_store_cache  tmp     tools

I am trying to build Kraken database and do some other analysis after launching Galaxy locally and all the output files are being deposited to the objects folder. But when I look at the permission of all these folders, they are owned by root users:

/galaxy/server/database$ ls -la objects/
total 42
drwxr-xr-x 15 galaxy root 15 May 29 12:09 .
drwxrwxrwx 14 root   root 15 May 24 13:28 ..
drwxr-xr-x  3 galaxy root  3 May 29 12:09 0
drwxrwxrwt  5 galaxy root  5 May 29 12:09 1
drwxr-xr-x  4 galaxy root  4 May 29 12:09 2
drwxrwxrwt  6 galaxy root  6 May 29 12:09 4
drwxr-xr-x  3 galaxy root  3 May 29 12:09 5
drwxr-xr-x  3 galaxy root  3 May 29 12:09 6
drwxrwxrwt  3 galaxy root  3 May 29 12:00 7
drwxrwxrwt  6 galaxy root  6 May 29 12:09 8
drwxr-xr-x  4 galaxy root  4 May 29 12:09 b
drwxrwxrwt  4 galaxy root  4 May 29 12:09 c
drwxrwxrwt  3 galaxy root  3 May 29 11:59 e
drwxrwxrwt  4 galaxy root  4 May 29 12:09 f
drwxr-xr-x  3 galaxy root  3 May 29 12:08 _metadata_files

In some cases, I could delete some of these folders, but I could not delete some folders with files that were created by someone else and hold the root permission. So, attempting to delete those folders results in permission-denied errors.

Can you please suggest how to delete those files/folders with root permission outside the local galaxy users like me? or is there a way to change the permissions of these folders?

ksuderman commented 1 month ago

See if the galaxy-maintenance Docker image can do what you need. It includes sudo, unlike the base Galaxy images, so you should be able to sudo rm anything you want (or sudo chown).

However, simply deleting files may lead to problems as there will still be entries for these files in the database. You may want to tweak the existing maintenance cronjob, but that may have the side effect of deleting objects you want to keep. If you know what jobs/tools created these files then you could try deleting/purging them in the Galaxy UI and then running the maintenance.sh script.

The galaxy-maintenance image also includes gxadmin, which may be useful. gxadmin might be present in the galaxy-min image, but I don't think the galaxy user has sufficient permissions to use it fully.

DuttaAnik commented 1 month ago

Hi @ksuderman, thank you very much for the suggestion. It is a bit high-level for me to understand though. At the moment, I do not have any of the above-mentioned files that you mentioned. I have a value.yaml file that I created and then deployed it. Can I make any changes in the values file? or can I create a pod for this galaxy_maintenance file?

ksuderman commented 1 month ago

I think maybe we should back up a little bit. What is the exact problem you are encountering? Why do you want to delete those files/directories? I am not familiar with Kraken, but any output it produces should be available in a Galaxy history. You almost never need to manually delete files from the objects directory.

DuttaAnik commented 1 month ago

The problem is that I am trying to build a database for Kraken. So, the data manager runs for some time and I can see some of the output files related to the database building are being deposited in the objects folder, but it did not finish successfully. It throws a permission-denied error. So, I am confused about why this is happening; I can deposit things in the PVC by mistake. I am thinking of mounting the PVC on a VM and then chmod everything in the objects folder. Although, I do not know if the problem is associated with the files being downloaded to build the database.

ksuderman commented 1 month ago

Thanks for the follow up. You are encountering issue #476 that was introduced in a recent update. I am working on a fix, but it is not quite ready yet. Is your data shareable? I have only encountered the problem with SnpEff so having another test case to validate my fix would be helpful. If not I understand. I should be able to get a patch ready for you by the end of the day.

DuttaAnik commented 1 month ago

Hi @ksuderman sorry for the delay. I do not have any data tbh. I was trying to create a database by using the data manager tool and it was fetching data most probably from this site June 2022. Does this help you?

ksuderman commented 1 month ago

Hi @DuttaAnik, I found your post on the Galaxy Help forum. That should give me enough to go on.

Sorry for not having a patch ready for you by now, but I've run into problems remounting some of the volumes after the patch is applied. It works if the patch is applied during installation and only fails trying to patch an existing install. I'll create a PR with the fix, but you may have to re-install Galaxy to get Kraken working.

DuttaAnik commented 1 month ago

Hi @ksuderman thank you very much for the reply. Would it work to only update the galaxy version number in my values.yaml file as I am running galaxy through the helm charts? I previously posted the value file in this post #461

ksuderman commented 1 month ago

@DuttaAnik No, this time the problem isn't with the Galaxy version but the values.yml file in the Helm chart. The chart will be fixed when PR #479 is merged (should be soon). If you want to try reinstalling before that PR is merged I've created a gist with instuctions for how to work around the problem with the current chart.

DuttaAnik commented 1 month ago

Hi @ksuderman I can wait until the merging has taken place. The last time I tried something with the nfs volume, it crashed my whole galaxy system since I deployed it through ArgoCD and do not use manual helm deployment using the command line.

ksuderman commented 1 month ago

We are still testing a proper fix. While I am not familiar with ArgoCD I took a quick look at their docs and you should be able to take the YAML snippet from the gist above, put it in your repo with your other values.yml file, and deploy a new Galaxy instance.

DuttaAnik commented 3 weeks ago

Hi @ksuderman I have updated the values.yaml file with the YAML snippet that you suggested Then, I reinstalled Galaxy and tried some usual tasks like uploading data. But I received an error message: Kubernetes failed to create job. Then, I looked at the log files of the galaxy job which shows something like this:

galaxy.util.task DEBUG 2024-06-13 07:58:48,435 [pN:job_handler_0,p:8,tN:HistoryAuditTablePruneTask] Executed periodic task HistoryAuditTablePruneTask (3.958 ms)
galaxy.jobs.handler DEBUG 2024-06-13 08:03:15,091 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 95
tpv.core.entities DEBUG 2024-06-13 08:03:15,112 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=k8s, dest_name=k8s, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=None, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>], handler_tags=None<class 'tpv.core.entities.Destination'> id=k8s, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=None, params={'docker_enabled': 'true', 'limits_cpu': '{cores}', 'limits_memory': '{mem}Gi', 'requests_cpu': '{cores}', 'requests_memory': '{mem}Gi'}, resubmit=None, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context=None, rules={}] for entity: <class 'tpv.core.entities.Tool'> id=force_default_container_for_built_in_tools, Rule: force_default_container_for_built_in_tools, abstract=False, cores=1, mem=cores * 3.8, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'container_monitor': False, 'docker_default_container_id': 'quay.io/galaxyproject/galaxy-min:23.1', 'tmp_dir': 'true', 'docker_container_id_override': 'quay.io/galaxyproject/galaxy-min:23.1'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=local, type=TagType.REJECT>, <Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=helpers.we, inherits=None, context={}, rules={} using custom function
galaxy.jobs.mapper DEBUG 2024-06-13 08:03:15,113 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] (95) Mapped job to destination id: k8s
galaxy.jobs.handler DEBUG 2024-06-13 08:03:15,116 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] (95) Dispatching to k8s runner
galaxy.jobs DEBUG 2024-06-13 08:03:15,125 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] (95) Persisting job destination (destination id: k8s)
galaxy.jobs DEBUG 2024-06-13 08:03:15,136 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] (95) Working directory for job is: /galaxy/server/database/jobs_directory/000/95
galaxy.jobs.runners DEBUG 2024-06-13 08:03:15,141 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Job [95] queued (25.671 ms)
galaxy.jobs.handler INFO 2024-06-13 08:03:15,143 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] (95) Job dispatched
galaxy.jobs.runners.kubernetes DEBUG 2024-06-13 08:03:15,146 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Starting queue_job for job 95
galaxy.jobs DEBUG 2024-06-13 08:03:15,237 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Job wrapper for Job [95] prepared (77.327 ms)
galaxy.jobs.command_factory INFO 2024-06-13 08:03:15,250 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Built script [/galaxy/server/database/jobs_directory/000/95/tool_script.sh] for tool command [python '/galaxy/server/lib/galaxy/tools/data_fetch.py' --galaxy-root '/galaxy/server' --datatypes-registry '/galaxy/server/database/jobs_directory/000/95/registry.xml' --request-version '1' --request '/galaxy/server/database/jobs_directory/000/95/configs/tmpv7ag2tc7']
galaxy.jobs.runners DEBUG 2024-06-13 08:03:15,257 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] (95) command is: mkdir -p working outputs configs
if [ -d _working ]; then
    rm -rf working/ outputs/ configs/; cp -R _working working; cp -R _outputs outputs; cp -R _configs configs
else
    cp -R working _working; cp -R outputs _outputs; cp -R configs _configs
fi
cd working; __out="${TMPDIR:-.}/out.$$" __err="${TMPDIR:-.}/err.$$"
mkfifo "$__out" "$__err"
trap 'rm -f "$__out" "$__err"' EXIT
tee -a '../outputs/tool_stdout' < "$__out" &
tee -a '../outputs/tool_stderr' < "$__err" >&2 & /bin/bash /galaxy/server/database/jobs_directory/000/95/tool_script.sh > "$__out" 2> "$__err"; return_code=$?; echo $return_code > /galaxy/server/database/jobs_directory/000/95/galaxy_95.ec; sh -c "exit $return_code"
galaxy.jobs.runners.kubernetes ERROR 2024-06-13 08:03:15,278 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Kubernetes failed to create job, HTTP exception encountered
Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/pykube/http.py", line 99, in raise_for_status
    resp.raise_for_status()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://10.43.0.1:443/apis/batch/v1/namespaces/galaxy/jobs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 194, in queue_job
    job.create()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/pykube/objects.py", line 97, in create
    self.api.raise_for_status(r)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: Job.batch "gxy-galaxy-k3s-rdloc-rc5f7" is invalid: [spec.template.spec.volumes[1].name: Invalid value: "None": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), spec.template.spec.containers[0].volumeMounts[0].name: Not found: "None", spec.template.spec.containers[0].volumeMounts[1].name: Not found: "None"]
galaxy.jobs.runners.kubernetes ERROR 2024-06-13 08:03:15,281 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] (95/None) User killed running job, but error encountered during termination: job name must not be empty
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 901, in stop_job
    job_to_delete = find_job_object_by_name(
  File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 88, in find_job_object_by_name
    raise ValueError("job name must not be empty")
ValueError: job name must not be empty

Did I do something wrong while adding the YAML snippet from the gist? Do you have any suggestion on how to solve this?

ksuderman commented 3 weeks ago

477 was just merged which should fix the problem. Can you revert to your original values and try again with the latest chart (v5.14.2)?