NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
126 stars 51 forks source link

GPU Driver Container Won't Start #54

Closed BHSDuncan closed 1 month ago

BHSDuncan commented 6 months ago

Essentially I'm seeing what's in this ticket: https://github.com/NVIDIA/gpu-operator/issues/564 (when I start up my machine running a cluster with a version of CNS installed, currently an old one, like 9.x)

...and because I'm using one of the playbooks from this repo, I'm not sure how to resolve this issue.

I'm also unsure as to why the issue is happening now...I've been running this on a machine since last fall, but the issue linked above pre-dates it.

Will updating to the latest CNS version solve this issue? Or will it still be a problem, given that it looks like the install.sh and Dockerfile(s) are pretty much the same. (I'll probably try doing this anyway on a test box but I wanted to ask here as well.)

Thank you.

angudadevops commented 6 months ago

@BHSDuncan I would recommend to try CNS 10.4 or CNS 11.1 with cns_nvidia_driver: yes flag in cns_values_10.4.yaml or cns_values_11.1.yaml file and trigger the installation. which will install Native TRD Driver on host which works with latest kernel.

If you want driver as part of GPU Operator then I would recommend to wait to hear from GPU Operator team.

BHSDuncan commented 6 months ago

But that will install a driver on the host itself, right? I'd prefer to avoid installing anything on the machine and keep the driver in the cluster. For that, you're saying I'll need to wait for the GPU Operator team? If so, they've made it known they're working on a fix. Once the fix is in place, will the CNS playbooks need updating?

angudadevops commented 6 months ago

yeah if you look at the comment https://github.com/NVIDIA/gpu-operator/issues/564#issuecomment-2020811882

so with latest kernel the current Operator fixed, will validate with CNS and then if it requires any changes will make the changes to CNS as well and let you know

angudadevops commented 1 month ago

@BHSDuncan CNS is updated with new Operator version, please check cns version: 11.3 and let us know.