galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
42 stars 39 forks source link attacher.MountDevice failed to create newCsiDriverClient: driver name not found in the list of registered CSI drivers #496

Open Truongphikt opened 2 months ago

Truongphikt commented 2 months ago

Hi galaxy-helm team,

My target is to deploy Galaxy on GKE. After undergoing #493, I ran into a problem that may be related to CSI drivers.


All storage is fine. But some workloads are in the pending stage and others show off the error "Does not have minimum availability" indefinitely. | workloads | storage | | ------------- | ------------ | |![image](|![image]( | At all error workflow, the pod stops at the initial stage ![image](

Look into error pod

After that, I checked the log of an error pod ``` $ kubectl describe pods my-galaxy-release-web-8df9fc56b-wfh8q Name: my-galaxy-release-web-8df9fc56b-wfh8q Namespace: default Priority: 0 Service Account: my-galaxy-release Node: gke-galaxy-cluster-default-pool-aada09c7-ktvt/ Start Time: Sun, 11 Aug 2024 03:25:36 +0000 Labels: pod-template-hash=8df9fc56b Annotations: checksum/galaxy_conf: 28bf33924622f4c62fc23e4cb0579231c491df17f7bc086b532a6a9fc0648859 checksum/galaxy_extras: 1cb6d207de441e5ed402124756fa900166791afb5986fd986c6056baa44e26ca checksum/galaxy_rules: 4e8361a62fb4b616e92fadf2fb8be8147d402a66bbfdc760913519f96e2cbe5c 2024-08-11T03:25:08+0000 Inf Status: Pending IP: IPs: Controlled By: ReplicaSet/my-galaxy-release-web-8df9fc56b Init Containers: galaxy-wait-db: Container ID: Image: Image ID: Port: Host Port: Args: sh -c until [ -f /galaxy/server/config/mutable/db_init_done_1 ]; do echo "waiting for DB initialization"; sleep 1; done; until timeout 1 bash -c "echo > /dev/tcp/my-galaxy-release-rabbitmq-server/5672"; do echo "waiting for rabbitmq service"; sleep 1; done; until [ -f /galaxy/server/config/mutable/init_mounts_done_1 ]; do echo "waiting for copying onto NFS"; sleep 1; done; until [ -f /galaxy/server/config/mutable/init_clone_done_1 ]; do echo "waiting for refdata copying"; sleep 1; done; echo "Initialization waits complete"; sleep 0; State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: Mounts: /galaxy/server/config/mutable/ from galaxy-data (rw,path="config") /var/run/secrets/ from kube-api-access-ptpfx (ro) Containers: galaxy-web: Container ID: Image: Image ID: Port: 8080/TCP Host Port: 0/TCP Args: sh -c /galaxy/server/.venv/bin/gunicorn "galaxy.webapps.galaxy.fast_factory:factory()" --timeout 300 --pythonpath /galaxy/server/lib -k galaxy.webapps.galaxy.workers.Worker -b --workers=1 --config python:galaxy.web_stack.gunicorn_config --preload State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 3 ephemeral-storage: 10Gi memory: 7G Requests: cpu: 100m ephemeral-storage: 1Gi memory: 1G Liveness: http-get http://:8080/galaxy/api/version delay=0s timeout=30s period=10s #success=1 #failure=30 Readiness: http-get http://:8080/galaxy/api/version delay=0s timeout=12s period=10s #success=1 #failure=12 Startup: http-get http://:8080/galaxy/api/version delay=30s timeout=80s period=5s #success=1 #failure=80 Environment: GALAXY_DB_USER_PASSWORD: Optional: false GALAXY_CONFIG_OVERRIDE_DATABASE_CONNECTION: postgresql://galaxydbuser:$(GALAXY_DB_USER_PASSWORD)@galaxy-my-galaxy-release-postgres/galaxy?sslmode=require GALAXY_CONFIG_OVERRIDE_ID_SECRET: Optional: false PYTHONPATH: /galaxy/server/lib GALAXY_CONFIG_FILE: /galaxy/server/config/galaxy.yml GALAXY_RABBITMQ_USERNAME: Optional: false GALAXY_RABBITMQ_PASSWORD: Optional: false GALAXY_CONFIG_OVERRIDE_AMQP_INTERNAL_CONNECTION: amqp://$(GALAXY_RABBITMQ_USERNAME):$(GALAXY_RABBITMQ_PASSWORD)@my-galaxy-release-rabbitmq-server:5672 Mounts: /cvmfs/ from galaxy-data (rw,path="cvmfsclone") /cvmfs/ from refdata-gxy (rw,path="") /galaxy/server/config/build_sites.yml from galaxy-conf-files (rw,path="build_sites.yml") /galaxy/server/config/container_resolvers_conf.xml from galaxy-conf-files (rw,path="container_resolvers_conf.xml") /galaxy/server/config/galaxy.yml from galaxy-conf-files (rw,path="galaxy.yml") /galaxy/server/config/integrated_tool_panel.xml from galaxy-conf-files (rw,path="integrated_tool_panel.xml") /galaxy/server/config/job_conf.yml from galaxy-conf-files (rw,path="job_conf.yml") /galaxy/server/config/mutable/ from galaxy-data (rw,path="config") /galaxy/server/config/sanitize_allowlist.txt from galaxy-conf-files (rw,path="sanitize_allowlist.txt") /galaxy/server/config/tool_conf.xml from galaxy-conf-files (rw,path="tool_conf.xml") /galaxy/server/config/workflow_schedulers_conf.xml from galaxy-conf-files (rw,path="workflow_schedulers_conf.xml") /galaxy/server/database from galaxy-data (rw) /galaxy/server/lib/galaxy/jobs/rules/tpv_rules_local.yml from galaxy-job-rules (rw,path="tpv_rules_local.yml") /galaxy/server/static/welcome.html from extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8 (rw,path="welcome.html") /var/run/secrets/ from kube-api-access-ptpfx (ro) Conditions: Type Status PodReadyToStartContainers False Initialized False Ready False ContainersReady False PodScheduled True Volumes: galaxy-conf-files: Type: ConfigMap (a volume populated by a ConfigMap) Name: my-galaxy-release-configs Optional: false extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8: Type: ConfigMap (a volume populated by a ConfigMap) Name: my-galaxy-release-extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8 Optional: false galaxy-job-rules: Type: ConfigMap (a volume populated by a ConfigMap) Name: my-galaxy-release-job-rules Optional: false galaxy-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: my-galaxy-release-galaxy-pvc ReadOnly: false refdata-gxy: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: my-galaxy-release-refdata-gxy-pvc ReadOnly: false kube-api-access-ptpfx: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: op=Exists for 300s op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NotTriggerScaleUp 55m cluster-autoscaler pod didn't trigger scale-up: Warning FailedScheduling 55m (x5 over 55m) default-scheduler 0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Normal Scheduled 55m default-scheduler Successfully assigned default/my-galaxy-release-web-8df9fc56b-wfh8q to gke-galaxy-cluster-default-pool-aada09c7-ktvt Warning FailedMount 21s (x3 over 55m) kubelet MountVolume.MountDevice failed for volume "pvc-8179672f-3188-4fde-a9e7-d22c9e2719d7" : attacher.MountDevice failed to create newCsiDriverClient: driver name not found in the list of registered CSI drivers ``` As far as I can see, the actual error is `attacher.MountDevice failed to create newCsiDriverClient: driver name not found in the list of registered CSI drivers`. I also checked that my cluster hasn't installed `CSI drivers`.

My question

So my question is: Based on the above information, the reason for the error is missing CSI drivers, isn't it? If it is, how to install CSI drivers "properly"? Thank you so much for the amazing platform and enthusiastic support.

ksuderman commented 2 months ago

The only thing you should have to do to install the cvmfs csi is set the following in your values.yaml file:

  enabled: true
  deploy: true

I assume you have that as it looks like CVMFS has been deployed.

However, I see that none of the cvmfsci-nodeplugin pods are in the Ready state and I suspect it is a problem with the name of the alien cache. Can you look in the logs for the cvmfs-nodeplugin and see what it is complaining about? If the logs mention they can't find the alien cache you can add the following to to your values.yaml file:

  enabled: true
  deploy: true
          name: cvmfs-alien-cache

See: #437

Truongphikt commented 2 months ago

@ksuderman Thanks for your information, I checked the cvmfscsi-nodeplugin workload but there is no pod run here. So I can't provide its log to you.

image image

ksuderman commented 2 months ago

Is there anything when you click the container logs link? Since the status shows as Ok I assume the startupProbe and livenessProbe are passing and just the readinessProbe fails resulting of 0/3 pods being ready.

Did you try setting the alien cache name?

Truongphikt commented 2 months ago

After re-deploying Galaxy, and updating values.yml as you recommended, unfortunately, the error remains. I have already checked Container logs and see it is empty.

Container logs Audit logs
image image
Status after setting the `alien cache name` (I downscaled node number from 3 to 2) | Workload | Storage | | ------------ | ---------- | |![image](|![image](|