aws-samples / comfyui-on-eks

ComfyUI on AWS
MIT No Attribution
113 stars 21 forks source link

FailedMount on comfyui #3

Closed ande28em closed 8 months ago

ande28em commented 8 months ago

After successfully following all prior steps and deploying the comfyui service, the pod hangs on ContainerCreating due to a failed mount. What am I missing?

Admin:~/environment $ kubectl get pods --all-namespaces
NAMESPACE      NAME                                                              READY   STATUS              RESTARTS   AGE
default        comfyui-67fd98b7-4zk8f                                            0/1     ContainerCreating   0          23m
gpu-operator   gpu-feature-discovery-nqm8z                                       1/1     Running             0          19m
gpu-operator   gpu-operator-657b8ffcc-clnf7                                      1/1     Running             0          84m
gpu-operator   nvidia-container-toolkit-daemonset-kgcdn                          1/1     Running             0          19m
gpu-operator   nvidia-cuda-validator-gjkxk                                       0/1     Completed           0          19m
gpu-operator   nvidia-dcgm-exporter-s65k7                                        1/1     Running             0          19m
gpu-operator   nvidia-device-plugin-daemonset-sbdhm                              1/1     Running             0          19m
gpu-operator   nvidia-gpu-operator-node-feature-discovery-gc-64bc8485cd-plql6    1/1     Running             0          84m
gpu-operator   nvidia-gpu-operator-node-feature-discovery-master-7fb4d549dtkff   1/1     Running             0          84m
gpu-operator   nvidia-gpu-operator-node-feature-discovery-worker-9qd95           1/1     Running             0          84m
gpu-operator   nvidia-gpu-operator-node-feature-discovery-worker-krbbz           1/1     Running             0          84m
gpu-operator   nvidia-gpu-operator-node-feature-discovery-worker-vhvrl           1/1     Running             0          19m
gpu-operator   nvidia-operator-validator-gxjt9                                   1/1     Running             0          19m
karpenter      karpenter-846f4df548-gzmkq                                        1/1     Running             0          84m
kube-system    aws-load-balancer-controller-6b9fd85d4c-dwqdp                     1/1     Running             0          84m
kube-system    aws-load-balancer-controller-6b9fd85d4c-srqk4                     1/1     Running             0          84m
kube-system    aws-node-d299g                                                    2/2     Running             0          20m
kube-system    aws-node-k2jb4                                                    2/2     Running             0          88m
kube-system    aws-node-wjmb4                                                    2/2     Running             0          88m
kube-system    coredns-d9b6d6c7d-cr7kv                                           1/1     Running             0          91m
kube-system    coredns-d9b6d6c7d-xxd6j                                           1/1     Running             0          91m
kube-system    kube-proxy-mtdl4                                                  1/1     Running             0          88m
kube-system    kube-proxy-sthhx                                                  1/1     Running             0          88m
kube-system    kube-proxy-vtgq6                                                  1/1     Running             0          20m
kube-system    s3-csi-node-bqt29                                                 3/3     Running             0          24m
kube-system    s3-csi-node-pv9h9                                                 3/3     Running             0          19m
kube-system    s3-csi-node-qn4cb                                                 3/3     Running             0          24m
kube-system    ssm-installer-6zfvl                                               1/1     Running             0          84m
kube-system    ssm-installer-7tvct                                               1/1     Running             0          19m
kube-system    ssm-installer-jkt7x                                               1/1     Running             0          84m
Admin:~/environment $ kubectl describe pod comfyui-67fd98b7-4zk8f
...
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  30m                   default-scheduler  0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
  Normal   Nominated         28m (x2 over 30m)     karpenter          Pod should schedule on: machine/karpenter-provisioner-tmk8t
  Normal   Nominated         26m                   karpenter          Pod should schedule on: machine/karpenter-provisioner-tmk8t, node/ip-10-2-145-4.ec2.internal
  Warning  FailedScheduling  26m                   default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
  Warning  FailedScheduling  25m                   default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod..
  Normal   Scheduled         25m                   default-scheduler  Successfully assigned default/comfyui-67fd98b7-4zk8f to ip-10-2-145-4.ec2.internal
  Warning  FailedMount       3m25s (x10 over 23m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[comfyui-outputs], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount       3m17s (x19 over 25m)  kubelet            MountVolume.SetUp failed for volume "comfyui-outputs-pv" : rpc error: code = Internal desc = Could not mount "comfyui-outputs-872646166659-us-east-1" at "/var/lib/kubelet/pods/b840e007-4461-4a81-b79c-cc1f13de6beb/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount": Could not check if "/var/lib/kubelet/pods/b840e007-4461-4a81-b79c-cc1f13de6beb/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount" is a mount point: stat /var/lib/kubelet/pods/b840e007-4461-4a81-b79c-cc1f13de6beb/volumes/kubernetes.io~csi/comfyui-outputs-pv/mount: no such file or directory, Failed to read /host/proc/mounts: open /host/proc/mounts: invalid argument
Admin:~/environment $ kubectl get pv
NAME                 CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                         STORAGECLASS   REASON   AGE
comfyui-outputs-pv   1200Gi     RWX            Retain           Bound    default/comfyui-outputs-pvc                           25m
Shellmode commented 8 months ago

It's because newer version(1.4.0) of mountpoint-s3-csi-driver is only compatible of Bottlerocket, and we're using Amazon linux, you need to downgrade the addon mountpoint-s3-csi-driver to v1.0.0, refer to Distros Support Matrix

I'll update the code later to fix that. You can also try it yourself.

Shellmode commented 8 months ago

Run following command to install fixed version of aws-mountpoint-s3-csi-driver addon

region="us-west-2" # Modify the region to your current region.
account=$(aws sts get-caller-identity --query Account --output text)
eksctl create addon --name aws-mountpoint-s3-csi-driver --version v1.0.0-eksbuild.1 --cluster Comfyui-Cluster --service-account-role-arn "arn:aws:iam::${account}:role/EKS-S3-CSI-DriverRole-${account}-${region}" --force

Doc has been revised.