NVIDIA support - enable use on Classic systems

jocado commented 6 months ago

This enables the use of nvidia runtime configuration and support on Ubuntu Classic systems also.

Summary of changes:

Add detection of nvidia support and prereqs on Classic
Ensure CDI config is configured correctly on Classic
Remove nvidia related snap layouts in favour of config paths to prevent clashes with host packaging on Classic
Update nvidia section of the README

CDI config generation requires a small opengl snap interface change, which has been merged and is in snapd-2.63, currently in the beta channel.

https://github.com/snapcore/snapd/pull/13847

fixes https://github.com/docker-snap/docker-snap/issues/127 fixes https://github.com/docker-snap/docker-snap/issues/148

jocado commented 6 months ago

@lucaskanashiro PTAL

I started work on this after the initial PR you reviewed relating to NVIDIA support. For that reason, and because I was waiting for a change in snapd, I decided to keep it separate.

If you're able to help review it I would be grateful. Thank you!

jocado commented 5 months ago

@lucaskanashiro The interface change this depends on is now int he stable version of snapd [ 2.63 ]

Can you please review this PR ?

jocado commented 4 months ago

Hello @jocado, I've tested your PR on a jammy machine trying to run an [nvidia sample workload]

Hi @locnnil

Thanks very much for taking a look :+1:

When don't installing the nvidia-core22 snap I had this error:

$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi 
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.

I believe that is a simple path issue, because nvidia CDI config isn't taking care of any of that, only LD library paths.

If you run it like this, it should work for you: [ but please see notes below about the nvidia-core22 snap to fix your installation first ]

docker run --rm --runtime=nvidia --gpus all -it ubuntu /var/lib/snapd/hostfs/usr/bin/nvidia-smi
Fri Jun  7 18:43:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:18:00.0 Off |                    0 |
| N/A   23C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:19:00.0 Off |                    0 |
| N/A   24C    P8               8W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:35:00.0 Off |                    0 |
| N/A   24C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:36:00.0 Off |                    0 |
| N/A   23C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla T4                       Off | 00000000:E7:00.0 Off |                    0 |
| N/A   25C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla T4                       Off | 00000000:E8:00.0 Off |                    0 |
| N/A   25C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla T4                       Off | 00000000:F4:00.0 Off |                    0 |
| N/A   24C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla T4                       Off | 00000000:F5:00.0 Off |                    0 |
| N/A   25C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

All of our use and test cases don't really care about the binary path there, they are just software written using the CUDA platform and the nvidia/cuda libs are discovered properly. So I must admit I hadn't really considered the binary path discovery.

I don't know if we should fix that, or document the behaviour. What do you think ?

After installing the nvidia-core22 I've got the error:
$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi 
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
That is very unlikely to work. If the user space library versions available to the snap and the hosts kernel nvidia kernel module versions don't match, then the nvidia toolkit and any of the CUDA functionality will fail to work. There will be a version mismatch error. So, the kernel module versions on your host would have to match the user space library versions in the current nvidia-core22 snap. It's possible, but not that likely by chance.

You get the invalid runtime error in this case because the snap service responsible for setting up the runtime and CDI config [ docker.nvidia-container-toolkit ] will remove the runtime from dockerd config if the setup process fails. You can check the logs of that service for more details in case of any error.

FYI there is a snap hook which runs when the graphics-core22 snap content interfaces is connect. To recover from the situation where you don't want the nvidia-core22 snap, you have to remove the snap and then restart the docker snap [ probably snap restart docker to restart everything is best ].

I hope that help explain a bit further. But please let me know if you would like any further info on that particular point.

Thanks!

locnnil commented 4 months ago

@jocado thank you very much for the information!

I don't know if we should fix that, or document the behaviour. What do you think ?

Regarding that, it's important to document it for the time being, with at least a quick note in the README.md mentioning this behaviour. I'll file a bug report to have it fixed later. Thank you very much for your contributions; we appreciate them!

jocado commented 4 months ago

All done form by point if view. Do you want me to rebase and squash any commits, or are you happy as is ?

lucaskanashiro commented 4 months ago

Thanks @jocado for the PR and @farshidtz and @locnnil for the reviews!

LGTM, +1.

lucaskanashiro commented 4 months ago

There is revision 2926 in the latest/edge channel with those changes, please test it!

jocado commented 4 months ago

Hi @lucaskanashiro Thanks for merging :+1:

However, I just went to test, and found that the content of the snap in revision 2926 in the latest/edge doesn't seem correct. It seems to be an older version of the code, and therefor is not working as expected.

Can you please check the status ?

For instance, the nvidia lib is missing stuff, and of course the hashes don't match:

Should be

$ md5sum nvidia/lib
19d3b6e62ea9154c0ce93d40eafc5dc7  nvidia/lib

But is

# md5sum /snap/docker/current/usr/share/nvidia-container-toolkit/lib 
6d08e2549e99315b7faf2e45c7c050a3  /snap/docker/current/usr/share/nvidia-container-toolkit/lib

FYI @locnnil and @farshidtz

lucaskanashiro commented 4 months ago

@jocado thanks for the heads-up! Let me check this out. Regardless of this, I've been working to automate the publish process which should avoid this kind of issue in the future.

lucaskanashiro commented 4 months ago

@jocado could you please test revision 2927 in the latest/edge channel? It should be fixed now.

jocado commented 4 months ago

Thanks @lucaskanashiro - tested and looks good to me :+1:

canonical / docker-snap

NVIDIA support - enable use on Classic systems #161