Open jocado opened 1 year ago
I tried forcibly unloading the kernel modules, then loading again, and that does make it work without the extra reboot.
The only problem with that approach is, it's impossible to guarantee what will be using those modules at the time.
I suppose one pragmatic, but not particularly elegant/subtle way forward, would be to add a config flag to nvidia-assemble
that enables a reboot option if the BUILD script is run [ new bits are compared and loaded ]. Something like snap set build.reboot=true
. It's not pretty, but it allows one way forward.
This snap would need the shutdown
interface in order to support it.
So this can happen when:
On boot:
I will see how this can be resolved inside the kernel snap & nvidia-assemble to prevent this.
Please let me know if I can help in any way, even if it's just testing, happy to do so.
i will try to move the module assembly from nvidia-assemble snap, into kernel snap hook. This should thus only allow the matching driver from the matching kernel revision to be loaded.
@xnox Great - that sounds like a good forward if possible :+1:
Would that make the nvidia assemble snap redundant ?
@xnox Great - that sounds like a good forward if possible :+1:
Would that make the nvidia assemble snap redundant ?
I wish, but I don't think so. mknod on every boot is still needed somehow.
Hi @xnox
I have noticed an issue after kernel update, whereby the kernel modules don't seem matched to NVIDIA library expectations until the system is rebooted again [ following the initial system reboot triggered by the kernel snap update ].
This is related partly related to the issue I reported here where kernel modules and user space libs seem to need to be in sync: https://github.com/snapcore/nvidia-core22/issues/6
What I observe, is the following:
docker.nvidia-container-toolkit
service to fail start. Also,docker run --gpus
commands will fail.Errors produced by the container toolkit are of the form:
Errors by the docker run are of the form:
I can reproduce the issue just by installing an older revision of the kernel snap, then after reboot installing the newer version again that matches the nvidia-core22 snap, I get the above errors until I do one final reboot.
The strange thing is that the module info seems to reflect the correct kernel versions. So, for example, with
nvidia-core22=515.105.01+mesa22.2.5
andpc-kernel=5.15.0-71.78.1
, after installing I will see what looks like the correct version info:..but it doesn't actually work until I do one final reboot.
I suspect the cause may be something around the fact that nvidia-assemble doesn't run the
BUILD
script until after the reboot, and so perhaps the module need reload after that, as they are already loaded during boot.What do you think ?
Can you think of any way that the extra reboot could be avoided ?
Thanks!