canonical / nvidia-core22

GNU General Public License v3.0
0 stars 4 forks source link

userpace NVIDIA libs out of sync with NVIDIA driver in pc-kernel #6

Closed jocado closed 1 year ago

jocado commented 1 year ago

Hi @xnox

It looks like the current status of components in pc-kernel and nvidia-core22 stable channels is:

pc-kernel

root@00620b04acaf:~# modinfo nvidia
filename:       /lib/modules/5.15.0-70-generic/kernel/nvidia-515srv/nvidia.ko
firmware:       nvidia/515.86.01/gsp.bin
alias:          char-major-195-*
version:        515.86.01

nvidia-core22

# ls -la /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01
-rw-r--r-- 1 root root 1683960 Feb 27 12:41 /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01

This unfortunately breaks some NVIDIA functionality. I have only so far checked the NVIDIA support in the docker snap, but I suspect there will be other things that rely on version matching too.

Is this accidental, or is ti expected to happen from time to time ? What's the best course of action here ?

Cheers, Just

xnox commented 1 year ago

This unfortunately breaks some NVIDIA functionality.

Can you demonstrate what exactly breaks? We have had some assurances before that despite warnings, most things should work despite mismatch between driver and userspace.

Currently there is no synchronization between promoting kernel & nvidia-core22 snaps on the store side. And separately no synchronization w.r.t. refreshing at runtime on devices. And I cannot add global validation sets to enforce that. I can ensure at least that things are promoted at roughly the same time.

If you have a brandstore, you can use validation sets to at least request simultanious refreshes.

Separately, snapd team is working on a spec, to implement in the future ability to ship driver+firmware+userspace-libs in a separate from the kernel snap. But no timelines yet on as to when it can be implemented.

Another option might be for you or us to ship fat kernel snap that contains graphics-core22 libraries and driver.... But then it will not be the default kernel track.

Ideally small missmatches should work between newer runtime libs on a slightly older kernel API.

jocado commented 1 year ago

Sure. I've found so far that it breaks the container toolkit setup:

# $SNAP/usr/bin/nvidia-ctk cdi generate --nvidia-ctk-path "$SNAP/usr/bin/nvidia-ctk" --output="$SNAP_DATA/etc/cdi/nvidia.yaml"
ERRO[0000] failed to generate CDI spec: ERROR_LIB_RM_VERSION_MISMATCH 

Also, as a quick test I added nvidia-smi to the docker snap, and it failed with something similar:

# nvidia-smi 
Failed to initialize NVML: Driver/library version mismatch

The future snapd work sounds interesting.

We have a brand store, so I will look into validation sets, although it sounds like that will be something extra to mange on our side, and slow down some security update delivery if we have to wait to kernel to catch up with the content provider snap. I will check ti tout.

We are trying to avoid shipping our own kernel, as it negates some of the value we get from the overall solution.

Ideally small missmatches should work between newer runtime libs on a slightly older kernel API.

It seems that some parts of the NVIDIA framework are fussy!

Cheers, Just

xnox commented 1 year ago

W.R.T. update latency, there was a new nvidia release, and nvidia.ko was rebuilt and published together with new userspace; but the kernel snap was not respun for it - but the userspace has been. Hence the latency delay.

We just had a kernel respin & release, including snap, and it went out just now. So things should be back to compatible. But yes, it is bad for userspace to be ahead of the driver.

I wonder if userspace being behind the driver is ok, or not? Are you able to do snap refresh nvidia-core22 --revision=14 (nvidia drivers 515.86.01+mesa22.2.5) whilst getting the latest pc-kernel 22/stable: 5.15.0-71.78.1 2023-04-26 (1281) which uses kernel driver 515.105.01?

jocado commented 1 year ago

Hi @xnox

Sorry for the radio silence on this issue. For now, we are probably going to try the validation sets approach. We are investigating that currently.

I will close this for now :+1: