NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.86k stars 212 forks source link

Fix for "Failed to initialize NVML: Unknown Error" #564

Open bryantbiggs opened 4 days ago

bryantbiggs commented 4 days ago

The issue described in https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 which is locked - it states

A fix will be present in the next patch release of all supported NVIDIA GPU drivers

Given that the issue was opened on Feb. 2023, has that fix landed in the NVIDIA data center drivers, and if so, starting at which version?

elezar commented 3 days ago

@cdesiniotis do you know which Data Center drivers includes this fix?

SQUIDwarrior commented 3 days ago

I am also experiencing this issue (#538), despite applying all workarounds mentioned. Clearly this issue has not be fully addressed.

bryantbiggs commented 3 days ago

@SQUIDwarrior are you setting compatWithCPUManager = true in your NVIDIA device plugin Helm chart values by chance? https://github.com/NVIDIA/k8s-device-plugin/blob/6cf4f2bd35ff1d055b808eb7abef106e1e5d3f08/deployments/helm/nvidia-device-plugin/values.yaml#L30

SQUIDwarrior commented 3 days ago

@bryantbiggs yes