galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
38 stars 36 forks source link

Cannot Download output file from Galaxy #461

Closed DuttaAnik closed 2 months ago

DuttaAnik commented 3 months ago

Capture Hello,

I have successfully installed Galaxy and deployed it in the Kubernetes cluster. I ran some analysis using Prokka and it completed the analysis and produced some output files. But I cannot download the output files. If I click on the download button, I receive an error message. I have attached the error message. Could anybody please suggest how to solve this issue?

This is the value.yaml file:

galaxy:
  fullnameOverride: galaxy
  nameOverride: galaxy
  revisionHistoryLimit: 3
  images:
    galaxy:
      repository: quay.io/galaxyproject/galaxy-min
      tag: "23.1"  # Value must be quoted
      pullPolicy: IfNotPresent
  refdata:
   enabled: false
   type: cvmfs
   pvc:
     size: 10Gi
  cvmfs:
    deploy: false
    storageClassName: "{{ $.Release.Name }}-cvmfs"
  persistence:
    enabled: true
    name: galaxy-pvc
    annotations: {}
    storageClassName: XXX
    existingClaim: galaxy-kubernetes-pvc
    accessMode: ReadWriteMany
    size: 200Gi
    mountPath: /galaxy/server/database
  rabbitmq:
    enabled: true
    deploy: true
    persistence:
      storageClassName: XXX
  celery:
    concurrency: 1
  postgresql:
    enabled: true
    deploy: true
    galaxyDatabaseUser: postgres
    galaxyDatabasePassword: XXXXXX
  configs:
    galaxy.yml:
      galaxy:
        admin_users: XX@xx.com

This is the ingress.yaml file:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: galaxy.XX.XX.cloud
  namespace: galaxy
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
spec:
  ingressClassName: nginx
  rules:
    - host: galaxy.XX.XX.cloud
      http:
        paths:
          - backend:
              service:
                name: galaxy-nginx
                port:
                  number: 8000
            path: /
            pathType: Prefix
  tls:
    - hosts:
        - galaxy.XX.XX.cloud
      secretName: galaxy.XX.XX.cloud

It would be great if you could provide me with any guidance. TIA

nuwang commented 3 months ago

Can you show your configuration for the ingress section in values.yaml? By default, it's configured to run under the /galaxy prefix, but looking at your ingress, it looks like it's running under /. If so, you could try configuring the ingress values to match as follows:

ingress:
  path: /
  hosts:
    - host: ~
      paths:
        - path: "/"
        - path: "/training-material"

If that's not configured, the galaxy internal nginx server could be picking up the wrong prefix when serving datasets, hence the issue with datasets only.

DuttaAnik commented 3 months ago

I do not have any ingress section in the value.yaml file. I have pasted the whole values.yaml file above. Should I copy the ingress code that you just shared in the values.yaml file? or should it be coded somewhere else?

nuwang commented 3 months ago

Aah, that probably explains it - I take it you created the ingress shown above by hand? You can get the helm chart to create the ingress for you by setting values appropriately. And yes, you can use the values shown above, but will probably also need to include a tls section and set the host value + annotations to match your environment. Take a look at the values.yaml for sample values.

DuttaAnik commented 3 months ago

Hi @nuwang thanks for the helpful tips. Yes, I have created the ingress file by hand. How can I get the helm chart to create the ingress file? Could you please explain? Sorry, I am new in this field. So, maybe I am not understanding the simple things. I got this following section from the original values-org.yaml file from the galaxy github.

ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
      nginx.ingress.kubernetes.io/proxy-buffering: "off"
      nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
      nginx.ingress.kubernetes.io/connection-proxy-header: "Upgrade"
      nginx.ingress.kubernetes.io/proxy-body-size: "0"
    hosts:
      - host: ~
        paths:
          - path: "/galaxy/api/upload/resumable_upload"
    tls: []
nuwang commented 3 months ago

I think you've extracted the tusd section. The main section you need to modify is here: https://github.com/galaxyproject/galaxy-helm/blob/0afe341dcae427832ac232c8d87c842436daf971/galaxy/values.yaml#L270

And afterwards. you can update the tusd section to match.

Something like:

ingress:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
  path: /
  hosts:
    - host: galaxy.XX.XX.cloud
      paths:
        - path: "/"
        - path: "/training-material"
  tls:
    - hosts:
        - galaxy.XX.XX.cloud
      secretName: galaxy.XX.XX.cloud
DuttaAnik commented 3 months ago

Hello @nuwang Thanks for the suggestions. I have modified the values.yaml file like below:

galaxy:
  fullnameOverride: galaxy
  nameOverride: galaxy
  revisionHistoryLimit: 3
  images:
    galaxy:
      repository: quay.io/galaxyproject/galaxy-min
      tag: "23.1"  # Value must be quoted
      pullPolicy: IfNotPresent
  refdata:
    enabled: false
    type: cvmfs
    pvc:
      size: 10Gi
  cvmfs:
    deploy: false
    storageClassName: "{{ $.Release.Name }}-cvmfs"
  persistence:
    enabled: true
    name: galaxy-pvc
    annotations: {}
    storageClassName: freenas-nfs-csi
    existingClaim: galaxy-k3s-rdloc-galaxy-pvc
    accessMode: ReadWriteMany
    size: 200Gi
    mountPath: /galaxy/server/database
  rabbitmq:
    enabled: true
    deploy: true
    persistence:
      storageClassName: freenas-iscsi-csi
  celery:
    concurrency: 1
  postgresql:
    enabled: true
    deploy: true
    galaxyDatabaseUser: postgres
    galaxyDatabasePassword: password
  configs:
    galaxy.yml:
      galaxy:
        admin_users: f@ss.com

# The suggested ingress configuration to be added to the values.yaml:
ingress:
  enabled: true
  ingressClassName: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
  hosts:
    - host: galaxy.XX.XX.cloud
      paths:
        - path: /
          pathType: Prefix
        - path: /
          pathType: Prefix
  tls:
    - hosts:
        - galaxy.XX.XX.cloud
      secretName: galaxy.XX.XX.cloud

tusd:
  enabled: true  

  ingress:
    enabled: true
    annotations:

    hosts:
      - host: galaxy.XX.XX.cloud
        paths:
          - path: /
            pathType: Prefix

But it did not solve the Download output files issue. I am still getting the same error message that I showed.

Besides, there is another issue. After making these changes, I tried to upload a datafile in Galaxy and there is also an error message and the data file cannot be uploaded.

The error message is: /bin/bash: line 1: /galaxy/server/database/jobs_directory/000/42/galaxy_42.sh: No such file or directory

This is the description file of the galaxy job from Kubernetes:


Name:                 gxy-galaxy-k3s-rdloc-ttmkc-4xdkq
Namespace:            galaxy
Priority:             -1000
Priority Class Name:  galaxy-job-priority
Service Account:      default
Node:                 xxx/xxxx
Start Time:           Mon, 08 Apr 2024 15:43:48 +0200
Labels:               app.galaxyproject.org/destination=k8s
                      app.galaxyproject.org/handler=job_handler_0
                      app.galaxyproject.org/job_id=41
                      app.kubernetes.io/component=tool
                      app.kubernetes.io/instance=gxy-galaxy-k3s-rdloc
                      app.kubernetes.io/managed-by=galaxy
                      app.kubernetes.io/name=x__DATA_FETCH__x
                      app.kubernetes.io/part-of=galaxy
                      app.kubernetes.io/version=0.1.0
                      batch.kubernetes.io/controller-uid=xxxx
                      batch.kubernetes.io/job-name=gxy-galaxy-k3s-rdloc-ttmkc
                      controller-uid=44343ed2-1e80-45fc-9b43-d474daea53d2
                      job-name=gxy-galaxy-k3s-rdloc-ttmkc
Annotations:          app.galaxyproject.org/tool_id: __DATA_FETCH__
Status:               Failed
IP:                   XX
IPs:
  IP:           XX
Controlled By:  Job/gxy-galaxy-k3s-rdloc-ttmkc
Containers:
  k8s:
    Container ID:  containerd://XX
    Image:         quay.io/galaxyproject/galaxy-min:23.1
    Image ID:      quay.io/galaxyproject/galaxy-min@sha256:XXX
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      /galaxy/server/database/jobs_directory/000/41/galaxy_41.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Mon, 08 Apr 2024 15:43:50 +0200
      Finished:     Mon, 08 Apr 2024 15:43:50 +0200
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  4080218931200m
    Requests:
      cpu:     1
      memory:  4080218931200m
    Environment:
      GALAXY_SLOTS:               1
      GALAXY_MEMORY_MB:           4080
      GALAXY_MEMORY_MB_PER_SLOT:  4080
    Mounts:
      /cvmfs/cloud.galaxyproject.org from galaxy-k3s-rdloc-galaxy-pvc (rw,path="cvmfsclone")
      /galaxy/server/database from galaxy-k3s-rdloc-galaxy-pvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rfcvn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  galaxy-k3s-rdloc-galaxy-pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  galaxy-k3s-rdloc-galaxy-pvc
    ReadOnly:   false
  kube-api-access-rfcvn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 20s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 20s
Events:                      <none>

Can you please suggest how to overcome these issues? I am a bit lost as to where this error is originating from. Thank you very much.

DuttaAnik commented 3 months ago

Any update @nuwang?

almahmoud commented 3 months ago

Hey @DuttaAnik. In your latest values, you seem to have changed the paths under ingress.hosts.[0].paths, but not under ingress.path as shown in Nuwan's example. Also, you have two identical paths for /. There might be other issues which might be hard to diagnose without access to your cluster, but something to try that might help is replacing:

ingress:
  enabled: true
  ingressClassName: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
  hosts:
    - host: galaxy.XX.XX.cloud
      paths:
        - path: /
          pathType: Prefix
        - path: /
          pathType: Prefix

with

ingress:
  path: /
  enabled: true
  ingressClassName: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
  hosts:
    - host: galaxy.XX.XX.cloud
      paths:
        - path: /
          pathType: Prefix

and let us know if that changes anything.

DuttaAnik commented 3 months ago

Hi @almahmoud and @nuwang Please let me clear up some confusion here. I have two values.yaml files. One is the original values-org.yaml file from galaxy and one is values.rdloc.k3s.yaml file that I am using for the specific Kubernetes cluster where galaxy is deployed. I have a gitlab repo, where there are two values yaml files. There is a galaxy_ingress.yaml, pvc_galaxy.yaml and secret_postgress.yaml file in the templates folder of the gitlab repo. Then, it is deployed through ArgoCD. So, I have updated the values.rdloc.k3s.yaml file like this:

galaxy:
  fullnameOverride: galaxy
  nameOverride: galaxy
  revisionHistoryLimit: 3
  images:
    galaxy:
      repository: quay.io/galaxyproject/galaxy-min
      tag: "23.1"  # Value must be quoted
      pullPolicy: IfNotPresent
  refdata:
   enabled: false
   type: cvmfs
   pvc:
     size: 10Gi
  cvmfs:
    deploy: false
    storageClassName: "{{ $.Release.Name }}-cvmfs"
  persistence:
    enabled: true
    name: galaxy-pvc
    annotations: {}
    storageClassName: freenas-nfs-csi
    existingClaim: galaxy-k3s-rdloc-galaxy-pvc
    accessMode: ReadWriteMany
    size: 200Gi
    mountPath: /galaxy/server/database
  rabbitmq:
    enabled: true
    deploy: true
    persistence:
      storageClassName: freenas-iscsi-csi
  celery:
    concurrency: 1
  postgresql:
    enabled: true
    deploy: true
    galaxyDatabaseUser: postgres
    galaxyDatabasePassword: xxxxx
  configs:
    galaxy.yml:
      galaxy:
        admin_users: aa@xxx.com

ingress:
  path: /
  enabled: true
  ingressClassName: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/ssl-passthrough: "false"
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
  hosts:
    - host: galaxy.XXX.cloud
      paths:
        - path: /
          pathType: Prefix

  tls:
    - hosts:
        - galaxy.XXX.cloud
      secretName: galaxy.XXX.cloud

tusd:
  enabled: true  

  ingress:
    enabled: true
    annotations:

    hosts:
      - host: galaxy.XXX.cloud
        paths:
          - path: /
            pathType: Prefix

Then I synced it through ArgoCD. But I could not download any existing output file and I get the following error now: {"err_msg":"Could not get display data for dataset: [Errno 2] No such file or directory: ''","err_code":500001}

As I mentioned, I also have a separate galaxy_ingress.yaml file that I have pasted in the first message. So, instead of making changes in the values.rdloc.k3s.yaml file, I copied the ingress and tusd sections to the ingress.yaml file. But still there was no change in the error message and I still cannot download any input values.

Could you please provide any solution to this issue?

Also, I cannot upload any data files, which I could do before. Now, I get the error /bin/bash: line 1: /galaxy/server/database/jobs_directory/000/42/galaxy_42.sh: No such file or directory

or the following error:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/runners/__init__.py", line 197, in put
    queue_job = job_wrapper.enqueue()
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1594, in enqueue
    self._set_object_store_ids(job)
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1612, in _set_object_store_ids
    self._set_object_store_ids_full(job)
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1702, in _set_object_store_ids_full
    self._setup_working_directory(job=job)
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1281, in _setup_working_directory
    working_directory = self._create_working_directory(job)
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1328, in _create_working_directory
    return create_working_directory_for_job(self.object_store, job)
  File "/galaxy/server/lib/galaxy/job_execution/setup.py", line 278, in create_working_directory_for_job
    object_store.create(job, base_dir="job_work", dir_only=True, obj_dir=True)
  File "/galaxy/server/lib/galaxy/objectstore/__init__.py", line 422, in create
    return self._invoke("create", obj, **kwargs)
  File "/galaxy/server/lib/galaxy/objectstore/__init__.py", line 413, in _invoke
    return self.__getattribute__(f"_{delegate}")(obj=obj, **kwargs)
  File "/galaxy/server/lib/galaxy/objectstore/__init__.py", line 783, in _create
    safe_makedirs(dir)
  File "/galaxy/server/lib/galaxy/util/path/__init__.py", line 138, in safe_makedirs
    makedirs(path)
  File "/usr/local/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/galaxy/server/database/jobs_directory/000/54'
ksuderman commented 3 months ago

What chart are you using to install Galaxy? The values you provided do not coincide with the values for the galaxy-helm chart. For example, the root level galaxy: element and the galaxy-helm chart expects persistence.storageClass but you have persistence.storageClassName. So you may want to check the PVC and make sure it is using the storage class it is supposed to be using. You mention having a galaxy_ingress.yaml file in the templates folder, if you set ingress.enabled = true in the values file that extra template may be causing conflicts when the Helm chart tries to setup an ingress. You should likely use one or the other.

Also, what version of Kubernetes are you using? I saw similar errors with Galaxy 23.x on later versions of Kubernetes due to an incompatibility with the version of py-kube used by Galaxy. That was fixed in 24.0 so you may want to try that Galaxy version.

Finally, I don't think the path in your tusd section is correct. It should likely be something like:

tusd:
    hosts:
      - host: galaxy.XXX.cloud
        paths:
          - path: /api/upload/resumable_upload
            pathType: Prefix

If that doesn't solve your problem can you kubectl exec into the job pod and check the owner and permissions of the files in /galaxy/server/database/jobs_directory. Those should be owned by the galaxy user and Galaxy should be running as the galaxy user.

DuttaAnik commented 2 months ago

Hi @ksuderman thank you for the reply and sorry for the delayed response. The Chart.yaml I am using to install Galaxy looks like this:

apiVersion: v2
name: galaxy
type: application
version: 1.0.0
dependencies:
 - name: galaxy
   repository: https://github.com/CloudVE/helm-charts/raw/master
   version: 5.9.0

The Kubernetes version is: v1.27.7+k3s2

The YAML for the PVC that I created (pvc-galaxy-k3s-rdloc-galaxy-pvc.yaml ) looks like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app.kubernetes.io/instance: galaxy-k3s-rdloc
    app.kubernetes.io/name: galaxy
  name: galaxy-k3s-rdloc-galaxy-pvc
  namespace: galaxy
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 200Gi
  storageClassName: freenas-nfs-csi

With the values.yaml file that I pasted in the first message, everything regarding the file upload was working smoothly, except the Downloading of output files. Then, I increased the PVC size to 400Gi from 100Gi but later on reduced to 200Gi. So, Once I increased the PVC to 400Gi, I could not reduce it. So, I deleted that PVC and created a new PVC with 200Gi. I checked with kubectl that the new PVC is there under the namespace of galaxy and the old one was removed. Then, I deployed again through ArgoCD and since then the upload does not work anymore. I have also removed ingress.enabled=true from the values file and still it is not working.

I executed the job pod through kubectl and listed the directories. It looks like that the database/ directory is owned by the root user of the PVC.

-rwxr-xr-x  1 galaxy galaxy   158 Oct 18 15:23 check_model.sh
-rw-r--r--  1 galaxy galaxy   871 Oct 18 15:23 CITATION
drwxr-xr-x  7 galaxy galaxy  4096 Oct 18 15:23 client
-rw-r--r--  1 galaxy galaxy   261 Oct 18 15:23 CODE_OF_CONDUCT.md
drwxr-x---  1 galaxy galaxy  4096 Apr 15 12:17 config
drwxr-xr-x  2 galaxy galaxy  4096 Oct 18 15:23 contrib
-rw-r--r--  1 galaxy galaxy  8997 Oct 18 15:23 CONTRIBUTING.md
-rw-r--r--  1 galaxy galaxy  8341 Oct 18 15:23 CONTRIBUTORS.md
drwxr-xr-x  2 galaxy galaxy  4096 Oct 18 15:23 cron
drwxrwxrwx 13 root   root      14 Apr  9 09:19 database

Is this the reason for the upload or download not working then?

ksuderman commented 2 months ago

I see you are still using 23.1; could you try with 24.0?

galaxy:
  image:
    tag: "24.0"

You should be able to helm upgrade an existing installation:


helm upgrade galaxy -n galaxy <your chart> --set galaxy.image.tag="24.0"
DuttaAnik commented 2 months ago

Thank you very much for your replies. The problem has been resolved.

ksuderman commented 2 months ago

Out of curiosity and to help us better assist users in the future, what exactly resolved your problem?

DuttaAnik commented 2 months ago

So, I deleted the mountPath: /galaxy/server/database from the values.yaml file. Then, I created a new PVC and only kept the following storageClass: "freenas-iscsi-csi" and deleted storageClassName.

ksuderman commented 2 months ago

Thanks for the follow up.