NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.18k stars 239 forks source link

Suggestion: Self-diagnostic command #179

Open rgov opened 1 year ago

rgov commented 1 year ago

My experience with the NVIDIA Docker integration across two PCs and a few Jetson devices is that it can be a bumpy experience, and the error messages are often fairly inscrutable. I've had a working system break a few days later due to an unattended upgrade.

It would probably help reduce the number of support requests if there were a self-diagnostic script that could look for common misconfiguration issues and/or format a bug report with all the relevant info for the user. The Homebrew project does this (brew doctor) and it turns out to be convenient for both users and project admins.

Over 1,500 issues have been filed to this repo and 1 in 5 mention some "Error response from daemon" message, like this one that doesn't give me enough information to remedy the situation.

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: insufficient permissions: unknown.

This problem appears to be related to permissions of /dev/nvidia* when virtualgl is set up on the host. Solution comment. This is an example of something that would be really easy for a script to detect but takes a bit of digging for the user to solve. (Also it hopefully wouldn't be that hard to say something more useful than "unknown" in this error message.)

leobenkel commented 6 months ago

Could you give more details on how you solved it ? I followed your link but still cant make it work