Closed jocado closed 4 months ago
@lucaskanashiro PTAL
I started work on this after the initial PR you reviewed relating to NVIDIA support. For that reason, and because I was waiting for a change in snapd, I decided to keep it separate.
If you're able to help review it I would be grateful. Thank you!
@lucaskanashiro The interface change this depends on is now int he stable version of snapd [ 2.63
]
Can you please review this PR ?
Hello @jocado, I've tested your PR on a jammy machine trying to run an [nvidia sample workload]
Hi @locnnil
Thanks very much for taking a look :+1:
When don't installing the
nvidia-core22
snap I had this error:$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.
I believe that is a simple path issue, because nvidia CDI config isn't taking care of any of that, only LD library paths.
If you run it like this, it should work for you: [ but please see notes below about the nvidia-core22 snap to fix your installation first ]
docker run --rm --runtime=nvidia --gpus all -it ubuntu /var/lib/snapd/hostfs/usr/bin/nvidia-smi
Fri Jun 7 18:43:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:18:00.0 Off | 0 |
| N/A 23C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:19:00.0 Off | 0 |
| N/A 24C P8 8W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:35:00.0 Off | 0 |
| N/A 24C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:36:00.0 Off | 0 |
| N/A 23C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla T4 Off | 00000000:E7:00.0 Off | 0 |
| N/A 25C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla T4 Off | 00000000:E8:00.0 Off | 0 |
| N/A 25C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla T4 Off | 00000000:F4:00.0 Off | 0 |
| N/A 24C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla T4 Off | 00000000:F5:00.0 Off | 0 |
| N/A 25C P8 10W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
All of our use and test cases don't really care about the binary path there, they are just software written using the CUDA platform and the nvidia/cuda libs are discovered properly. So I must admit I hadn't really considered the binary path discovery.
I don't know if we should fix that, or document the behaviour. What do you think ?
After installing the
nvidia-core22
I've got the error:$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi docker: Error response from daemon: unknown or invalid runtime name: nvidia. See 'docker run --help'.
That is very unlikely to work. If the user space library versions available to the snap and the hosts kernel nvidia kernel module versions don't match, then the nvidia toolkit and any of the CUDA functionality will fail to work. There will be a version mismatch error. So, the kernel module versions on your host would have to match the user space library versions in the current
nvidia-core22
snap. It's possible, but not that likely by chance.
You get the invalid runtime error in this case because the snap service responsible for setting up the runtime and CDI config [ docker.nvidia-container-toolkit
] will remove the runtime from dockerd config if the setup process fails.
You can check the logs of that service for more details in case of any error.
FYI there is a snap hook which runs when the graphics-core22
snap content interfaces is connect. To recover from the situation where you don't want the nvidia-core22
snap, you have to remove the snap and then restart the docker snap [ probably snap restart docker
to restart everything is best ].
I hope that help explain a bit further. But please let me know if you would like any further info on that particular point.
Thanks!
@jocado thank you very much for the information!
I don't know if we should fix that, or document the behaviour. What do you think ?
Regarding that, it's important to document it for the time being, with at least a quick note in the README.md mentioning this behaviour. I'll file a bug report to have it fixed later. Thank you very much for your contributions; we appreciate them!
All done form by point if view. Do you want me to rebase and squash any commits, or are you happy as is ?
Thanks @jocado for the PR and @farshidtz and @locnnil for the reviews!
LGTM, +1.
There is revision 2926
in the latest/edge
channel with those changes, please test it!
Hi @lucaskanashiro Thanks for merging :+1:
However, I just went to test, and found that the content of the snap in revision 2926
in the latest/edge
doesn't seem correct. It seems to be an older version of the code, and therefor is not working as expected.
Can you please check the status ?
For instance, the nvidia lib is missing stuff, and of course the hashes don't match:
Should be
$ md5sum nvidia/lib
19d3b6e62ea9154c0ce93d40eafc5dc7 nvidia/lib
But is
# md5sum /snap/docker/current/usr/share/nvidia-container-toolkit/lib
6d08e2549e99315b7faf2e45c7c050a3 /snap/docker/current/usr/share/nvidia-container-toolkit/lib
FYI @locnnil and @farshidtz
@jocado thanks for the heads-up! Let me check this out. Regardless of this, I've been working to automate the publish process which should avoid this kind of issue in the future.
@jocado could you please test revision 2927
in the latest/edge
channel? It should be fixed now.
Thanks @lucaskanashiro - tested and looks good to me :+1:
This enables the use of nvidia runtime configuration and support on Ubuntu Classic systems also.
Summary of changes:
CDI config generation requires a small opengl snap interface change, which has been merged and is in
snapd-2.63
, currently in the beta channel.fixes https://github.com/docker-snap/docker-snap/issues/127 fixes https://github.com/docker-snap/docker-snap/issues/148