galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
41 stars 38 forks source link

GKE: storageclass.storage.k8s.io "nfs" not found #493

Closed Truongphikt closed 1 month ago

Truongphikt commented 1 month ago

Hi galaxy-helm team,

I aim to build the galaxy on GKE (Google Kubernetes Engine) to organize free courses on bioinformatics. I created an Autopilot cluster and followed the guide in README to install it by the helm. However, I encountered an error that seemed to be related to storage (not sure).

$ helm install my-galaxy-release cloudve/galaxy
W0806 14:39:02.078486     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated DaemonSet default/my-galaxy-release-cvmfscsi-nodeplugin: defaulted unspecified 'cpu' resource for containers [registrar, nodeplugin, automount, automount-reconciler, singlemount] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.361216     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-cvmfscsi-controllerplugin: defaulted unspecified 'cpu' resource for containers [provisioner, controllerplugin] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.485846     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-rabbitmq: defaulted unspecified 'cpu' resource for containers [rabbitmq-cluster-operator] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.492726     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-postgres: adjusted 'cpu' resource to meet requirements for containers [postgresql] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.526161     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-web: adjusted 'cpu' resource to meet requirements for containers [galaxy-wait-db, galaxy-web] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.590688     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-rabbitmq-messaging-topology-operator: defaulted unspecified 'cpu' resource for containers [rabbitmq-cluster-operator] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.611198     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-nginx: adjusted 'cpu' resource to meet requirements for containers [galaxy-init-static, galaxy-nginx] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.732096     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-celery-beat: adjusted 'cpu' resource to meet requirements for containers [galaxy-wait-db, galaxy-celery-beat] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.824096     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-celery: adjusted 'cpu' resource to meet requirements for containers [galaxy-wait-db, galaxy-celery] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:02.843424     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-job-0: adjusted 'cpu' resource to meet requirements for containers [galaxy-wait-db, galaxy-job-0] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:03.411692     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-tusd: adjusted 'cpu' resource to meet requirements for containers [galaxy-tusd] (see http://g.co/gke/autopilot-defaults).
W0806 14:39:03.607657     962 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Deployment default/my-galaxy-release-workflow: adjusted 'cpu' resource to meet requirements for containers [galaxy-wait-db, galaxy-workflow] (see http://g.co/gke/autopilot-defaults).
Error: INSTALLATION FAILED: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-default-linux-capabilities]":["linux capability 'SYS_ADMIN' on container 'nodeplugin' not allowed; Autopilot only allows the capabilities: 'AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT,SYS_PTRACE'.","linux capability 'SYS_ADMIN' on container 'automount' not allowed; Autopilot only allows the capabilities: 'AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT,SYS_PTRACE'.","linux capability 'SYS_ADMIN' on container 'automount-reconciler' not allowed; Autopilot only allows the capabilities: 'AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT,SYS_PTRACE'.","linux capability 'SYS_ADMIN' on container 'singlemount' not allowed; Autopilot only allows the capabilities: 'AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT,SYS_PTRACE'."],"[denied by autogke-disallow-hostnamespaces]":["enabling hostPID is not allowed in Autopilot."],"[denied by autogke-disallow-privilege]":["container nodeplugin is privileged; not allowed in Autopilot","container automount is privileged; not allowed in Autopilot","container automount-reconciler is privileged; not allowed in Autopilot","container singlemount is privileged; not allowed in Autopilot"],"[denied by autogke-no-write-mode-hostpath]":["hostPath volume socket-dir in container registrar is accessed in write mode; disallowed in Autopilot.","hostPath volume registration-dir in container registrar is accessed in write mode; disallowed in Autopilot.","hostPath volume plugins-dir in container nodeplugin is accessed in write mode; disallowed in Autopilot.","hostPath volume pods-mount-dir in container nodeplugin is accessed in write mode; disallowed in Autopilot.","hostPath volume host-sys in container nodeplugin is accessed in write mode; disallowed in Autopilot.","hostPath volume host-dev in container nodeplugin is accessed in write mode; disallowed in Autopilot.","hostPath volume autofs-root in container nodeplugin is accessed in write mode; disallowed in Autopilot.","hostPath volume lib-modules used in container nodeplugin uses path /lib/modules which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-sys in container automount is accessed in write mode; disallowed in Autopilot.","hostPath volume host-dev in container automount is accessed in write mode; disallowed in Autopilot.","hostPath volume autofs-root in container automount is accessed in write mode; disallowed in Autopilot.","hostPath volume cvmfs-localcache in container automount is accessed in write mode; disallowed in Autopilot.","hostPath volume lib-modules used in container automount uses path /lib/modules which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume autofs-root in container automount-reconciler is accessed in write mode; disallowed in Autopilot.","hostPath volume cvmfs-localcache in container automount-reconciler is accessed in write mode; disallowed in Autopilot.","hostPath volume plugins-dir in container singlemount is accessed in write mode; disallowed in Autopilot.","hostPath volume pods-mount-dir in container singlemount is accessed in write mode; disallowed in Autopilot.","hostPath volume host-sys in container singlemount is accessed in write mode; disallowed in Autopilot.","hostPath volume host-dev in container singlemount is accessed in write mode; disallowed in Autopilot.","hostPath volume lib-modules used in container singlemount uses path /lib/modules which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]}
Requested by user: 'phinguyen@ktest.vn', groups: 'system:authenticated'.

image image image

Did someone meet this issue when built on GKE before? Please give me some recommendations on how to solve it. Thanks.

almahmoud commented 1 month ago

Hey @Truongphikt. Could you please provide more information on the setup? Did you deploy the NFS chart? From what you've provided, it seems the issue is that you don't have the nfs storage class. You either need to create a ReadWriteMany storage class under name nfs, or change values to indicate a different storage class. If you could provide a snapshot of your values, with secret values (eg: passwords) redacted, that could help us provide further support.

ksuderman commented 1 month ago

You may also want to look at our integration tests for a working example of deploying Galaxy to a GKE cluster, although we don't use Autopilot. Besides the missing nfs storage class you also seem to have a permission problem; "linux capability 'SYS_ADMIN' on container 'nodeplugin' not allowed", which seems to be Autopilot related. Although may be caused by the missing nfs storage class.

Truongphikt commented 1 month ago

@almahmoud Thanks for the rapid support. I haven't deployed the NFS chart, so that makes sense! Is it available to deploy by either helm or in another special way? Besides, this is more information on the setup, storage and values.

Setup ![image](https://github.com/user-attachments/assets/0739cfb3-ce5c-4007-aea0-a5edc7d577f8) ![image](https://github.com/user-attachments/assets/6822d50e-15cd-4045-a610-308bf24a14a0) ![image](https://github.com/user-attachments/assets/7f365dc5-698f-4492-bc16-4a471f92efda) ![image](https://github.com/user-attachments/assets/fa099f75-f229-40f6-95be-f53674692749) ![image](https://github.com/user-attachments/assets/acd2ba8f-5a9f-4cd1-b691-8d6f37dcec41)
Storage ![image](https://github.com/user-attachments/assets/e0a19c54-c621-4af5-8621-e27a7c0b1362) ![image](https://github.com/user-attachments/assets/527ac728-5a18-47c1-9696-048c92066712)
values.yml And this is `values.yml` I used (origin from repo): ``` # Default values for Galaxy. # Declare variables to be passed into your templates. #- Partial override of the `galaxy.fullname`. The `.Release.Name` will be prepended to generate the fullname. nameOverride: "" #- Fully override the `galaxy.fullname` fullnameOverride: "" image: #- Repository containing the Galaxy image. repository: quay.io/galaxyproject/galaxy-min #- Galaxy Docker image tag (generally corresponds to the desired Galaxy version) tag: "24.1.1" # Galaxy versions prior to 24.1.1 contain a bug mapping the extra_files directory #- Galaxy image [pull policy](https://kubernetes.io/docs/concepts/configuration/overview/#container-images) pullPolicy: IfNotPresent #- Secrets used to [access a Galaxy image from a private repository](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/) imagePullSecrets: [] trainingHook: #- Enable the GTN webhook to link references to tools in tutorials to the corresponding tool panel in Galaxy. enabled: false #- The training material server used to service the training-material webhook. url: https://training.galaxyproject.org/training-material/ service: #- The Galaxy service type type: ClusterIP #- The port Galaxy is listening to port: 8000 #- The external port exposed on each node nodePort: 30700 workflowHandlers: replicaCount: 1 startupDelay: 10 # used to avoid race conditions annotations: {} podAnnotations: {} podSpecExtra: {} ```
ksuderman commented 1 month ago

We use NFS Ganesha for NFS on Kubernetes.

helm repo add nfs-ganesha https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/

Create a values file (say nfs-values.yml). You may need/want to change the persistence.storageClass, persistence.size , and storageClass.defaultClass to suit your needs:

persistence:
  enabled: true
  storageClass: "standard"
  size: "250Gi"
storageClass:
  create: true
  defaultClass: true
  allowVolumeExpansion: true
  reclaimPolicy: "Retain"
  mountOptions:
    - vers=4.2
    - noatime

I'm not sure if the mountOptions are really needed, but this is what is used in our Galaxy Kubeman Helm chart.

You can then install with:

helm install nfs-provisioner -n nfs-provisioner nfs-ganesha/nfs-server-provisioner --create-namespace --values nfs-values.yml
Truongphikt commented 1 month ago

Successfully created nfs storage by @ksuderman's instruction on the Standard cluster

LAST DEPLOYED: Sat Aug 10 03:54:02 2024
NAMESPACE: nfs-provisioner
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The NFS Provisioner service has now been installed.

A storage class named 'nfs' has now been created
and is available to provision dynamic volumes.

You can use this storageclass by creating a `PersistentVolumeClaim` with the
correct storageClassName attribute. For example:

    ---
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: test-dynamic-volume-claim
    spec:
      storageClassName: "nfs"
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 100Mi

image image

Note: If creating nfs storage on the Autopilot cluster, the error was got:

$ helm install nfs-provisioner -n nfs-provisioner nfs-ganesha/nfs-server-provisioner --create-namespace --values nfs-values.yml
W0810 09:57:03.349099     884 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated StatefulSet nfs-provisioner/nfs-provisioner-nfs-server-provisioner: defaulted unspecified 'cpu' resource for containers [nfs-server-provisioner] (see http://g.co/gke/autopilot-defaults).
Error: INSTALLATION FAILED: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-default-linux-capabilities]":["linux capability 'DAC_READ_SEARCH,SYS_RESOURCE' on container 'nfs-server-provisioner' not allowed; Autopilot only allows the capabilities: 'AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT,SYS_PTRACE'."]}
Requested by user: 'phinguyen@ktest.vn', groups: 'system:authenticated'.