canonical / nvidia-assemble

2 stars 2 forks source link

Kernel update requires additional reboot #5

Open jocado opened 1 year ago

jocado commented 1 year ago

Hi @xnox

I have noticed an issue after kernel update, whereby the kernel modules don't seem matched to NVIDIA library expectations until the system is rebooted again [ following the initial system reboot triggered by the kernel snap update ].

This is related partly related to the issue I reported here where kernel modules and user space libs seem to need to be in sync: https://github.com/snapcore/nvidia-core22/issues/6

What I observe, is the following:

Errors produced by the container toolkit are of the form:

2023-05-22T13:44:45Z docker.nvidia-container-toolkit[3608]: time="2023-05-22T13:44:45Z" level=error msg="failed to generate CDI spec: ERROR_LIB_RM_VERSION_MISMATCH"

Errors by the docker run are of the form:

nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.

I can reproduce the issue just by installing an older revision of the kernel snap, then after reboot installing the newer version again that matches the nvidia-core22 snap, I get the above errors until I do one final reboot.

The strange thing is that the module info seems to reflect the correct kernel versions. So, for example, with nvidia-core22=515.105.01+mesa22.2.5 and pc-kernel=5.15.0-71.78.1, after installing I will see what looks like the correct version info:

# modinfo nvidia |grep version
version:        515.105.01
srcversion:     5B96463A6EE6DE62E590DF0
vermagic:       5.15.0-71-generic SMP mod_unload modversions 
# modinfo nvidia_drm |grep version
version:        515.105.01
srcversion:     E6AA36496C051B52463BD24
vermagic:       5.15.0-71-generic SMP mod_unload modversions 
# modinfo nvidia_uvm |grep version
srcversion:     7EC9D25908B29657D3E6BEF
vermagic:       5.15.0-71-generic SMP mod_unload modversions 

..but it doesn't actually work until I do one final reboot.

I suspect the cause may be something around the fact that nvidia-assemble doesn't run the BUILD script until after the reboot, and so perhaps the module need reload after that, as they are already loaded during boot.

What do you think ?

Can you think of any way that the extra reboot could be avoided ?

Thanks!

jocado commented 1 year ago

I tried forcibly unloading the kernel modules, then loading again, and that does make it work without the extra reboot.

The only problem with that approach is, it's impossible to guarantee what will be using those modules at the time.

jocado commented 1 year ago

I suppose one pragmatic, but not particularly elegant/subtle way forward, would be to add a config flag to nvidia-assemble that enables a reboot option if the BUILD script is run [ new bits are compared and loaded ]. Something like snap set build.reboot=true. It's not pretty, but it allows one way forward.

This snap would need the shutdown interface in order to support it.

xnox commented 1 year ago

So this can happen when:

On boot:

I will see how this can be resolved inside the kernel snap & nvidia-assemble to prevent this.

jocado commented 1 year ago

Please let me know if I can help in any way, even if it's just testing, happy to do so.

xnox commented 1 year ago

i will try to move the module assembly from nvidia-assemble snap, into kernel snap hook. This should thus only allow the matching driver from the matching kernel revision to be loaded.

jocado commented 1 year ago

@xnox Great - that sounds like a good forward if possible :+1:

Would that make the nvidia assemble snap redundant ?

xnox commented 1 year ago

@xnox Great - that sounds like a good forward if possible :+1:

Would that make the nvidia assemble snap redundant ?

I wish, but I don't think so. mknod on every boot is still needed somehow.