aws-samples / comfyui-on-eks

ComfyUI on AWS
MIT No Attribution
96 stars 15 forks source link

"nvidia-gpu-operator" does not exist #10

Open kabelo-twala opened 1 month ago

kabelo-twala commented 1 month ago

Hi, this used to work alright a few days ago but now after running cdk deploy Comfyui-Cluster the stack rolls back with the following error.

Received response status [FAILED] from custom resource. Message returned: Error: b'Release "nvidia-gpu-operator" does not exist. Installing it now.\nError: failed to fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v24.3.0.tgz : 401 Unauthorized\n' Logs: /aws/lambda/Comfyui-Cluster-awscdkawseksKubect-Handler886CB40B-Pn5FFiLl2QgE at invokeUserFunction

As you can imagine working with Cloudformation takes a while to complete events and I've spent a couple of hours on this issue. Please assist.

Screenshot 2024-08-14 at 12 17 26
kabelo-twala commented 1 month ago

I looked up the issue and found this which is a workaround that worked.

I've updated the repository value in node_modules/@aws-quickstart/eks-blueprints/dist/addons/gpu-operator/index.js

from

const defaultProps = { name: "gpu-operator-addon", namespace: "gpu-operator", chart: "gpu-operator", version: "v24.3.0", release: "nvidia-gpu-operator", repository: "https://helm.ngc.nvidia.com/nvidia", createNamespace: true, values: {} };

to

const defaultProps = { name: "gpu-operator-addon", namespace: "gpu-operator", chart: "gpu-operator", version: "v24.3.0", release: "nvidia-gpu-operator", repository: "https://nvidia.github.io/gpu-operator", createNamespace: true, values: {} };

and after running cdk deploy Comfyui-Cluster all the Cloudformation events were created.

Again this is a temporary fix. It would be great if this was addressed. I'll also create an issue on the package for the @aws-quickstart owners to maybe resolve

Shellmode commented 1 month ago

Thanks for your deep dive, @kabelo-twala. I'll try to reproduce the issue in a new environment.