Open SomeoneSerge opened 1 year ago
I've actually been looking for a solution to the "update-causes-version-mismatch" to make it possible to backport nvidia driver updates.
What I've been considering is a variant of the /etc/static
link dance in combination with tmpfiles rules to link the active userspace library to the loaded kernel module.
Like this:
/run/opengl-driver/lib/libcuda.so.1 -> /run/nvidia/current/lib/libcuda.so.1
/run/nvidia/current -> /run/nvidia/<version>
/run/nvidia/<version> -> /nix/store/nvidia-x11-<version>-<hash>
What this will require is:
/nix/store/nvidia-x11-<version>-<hash>
into /run/nvidia/current
. This mirrored derivation then gets added to hardware.opengl.extraPackages
/run/nvidia
symlinks for both the version and current at boot. This can be expanded to a udev rule to support runtime upgrades of the nvidia driver./run/nvidia/current
symlink into a gc root so it doesn't get garbage collected.Note that /run/nvidia
is a placeholder and could probably use a more unique nix-specific name.
@Kiskae I was actually thinking in a similar direction! Specifically, we could keep track of every deployed configuration's drivers by exposing the package
derivation from nixos/modules/hardware/opengl.nix
using systemPackages
and pathsToLink
:
let
package = pkgs.buildEnv {
name = "drivers";
paths = [ config.hardware.opengl.package ] ++ config.hardware.opengl.extraPaths;
postBuild = ''
mkdir drivers
mv * drivers/
'';
};
in
{
environment.systemPackages = [ package ];
environment.pathsToLink = [ "/drivers" ];
}
With this, we'd have access to (NB "booted") /run/booted-system/sw/drivers/lib/libcuda.so
and to (NB "current") /run/current-system/sw/drivers/lib/{dri,gbm,...}
(whatever breaks alacritty in the example above), which we could symlink to from /run/opengl-driver/lib
. I'd say this feels brittle and too many symlinks to me at a glance, but it's something we definitely could make to work.
Observation: with this solution people switching from hardware.opengl.enable = false
to true
won't be able to use CUDA apps without a reboot, because /run/booted-system
will stay the same, i.e. it won't contain any libcuda.so
The risk is that by tying the nvidia driver to booted-system
its own dependencies might become outdated compared to the current active profile. As it currently exists the only desync happens between kernelspace and userspace which is generally stable. (unless you're named NVIDIA)
That is why I'm considering the symlink indirection, since it will allow updates to the nvidia driver closure as long as the nvidia driver remains on the same version. In addition you could add a warning in switch-to-derivation
if the nvidia driver no longer matches to alert the user.
EDIT: I seemed to recall seeing a PR related to moving /run/opengl-driver
into the system closure, it appears to be https://github.com/NixOS/nixpkgs/pull/158079
@Kiskae maybe I didn't make myself clear, but I was trying to suggest that we'd have both /run/current-system/sw/drivers
and /run/booted-system/sw/drivers
: one corresponds to the last switched-to configuration, and the latter corresponds to the configuration booted from. Then we'd make all of /run/opengl-driver/lib
link to /run/current-system/sw/drivers
(equivalent to what we do now), except for libcuda.so
(and, maybe, libnvidia-ml.so
- as much as would be required to make cuda work), which we'd make to point at the old/booted system instead
EDIT: RE: https://github.com/NixOS/nixpkgs/pull/158079
Wonderful! I forgot that wasn't just about naming the nixos option. So we might just merge that PR, and then make the indirection in addOpenGLRunpath.driverLink
more granular on NixOS, by redirecting chosen libraries to the /run/booted-system/drivers
instead of /run/current-system/drivers
I understood that part, what I'm talking about is the more complex libraries like the vulkan driver which depends on other libraries.
So if libnvidia-vulkan* depends on libgbm
and is loaded from boot, but something else is linked with a newer version of mesa which has a different version of libgbm
then that can cause issues with dynamic loading.
Mind you this exact thing would still happen in my solution when the version of the driver changes, but as long as the version remains the same the nvidia driver closure can be updated in sync with the rest of the system.
Essentially there are two ways the driver can cause issues:
libnvidia-vulkan*
... is loaded from boot
But do we need to load libnvidia-vulkan*
from "boot"? Why don't we load it from /run/current-system
instead, this seems to have worked so far, and there's only two libraries (cuda, nvml) I know by now we might want to load from /run/booted-system
But do we need to load
libnvidia-vulkan*
from "boot"?
Yup same issue as libcuda
, almost all driver libraries will crash if the kernel module doesn't match.
❯ find -L /run/opengl-driver/lib -name "lib*535*" -exec fgrep "API mismatch" {} +
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libcuda.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvcuvid.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-allocator.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-cfg.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-eglcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glsi.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-ml.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-opencl.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/vdpau/libvdpau_nvidia.so.535.86.05: binary file matches
the nvidia vulkan driver is actually lib(GLX|EGL)_nvidia
which depend on libnvidia-e?glcore
.
There's one more thing we've missed: nixos-rebuild switch
doesn't actually break CUDA all that often (I think the heuristic is that libcuda.so
needs be at least as new as the kernel module, and it's usually OK if it's newer), but it currently does break nvidia-smi
which comes from the nvidia_x11
. E.g. right now I'm seeing:
❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 545.29
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L cudaPackages.saxpy
Start
Runtime version: 11080
Driver version: 12030
Host memory initialized, copying to the device
Scheduled a cudaMemcpy, calling the kernel
Scheduled a kernel call
Max error: 0.000000
but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:
nvidia-smi
ignores /run/opengl-driver
and links directly to the associated library files at the moment. So that error is coming from the 'newer' libnvidia-ml.so
.
Runtime version: 11080 Driver version: 12030
These probably refer to libcuda
and libcudart
, not the kernel drivers.
However the most recent update was 545.29.02
-> 545.29.06
, so it might very well be that the cuda driver is the same on these releases.
I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda
and the toolkit libraries, not between libcuda
and the driver itself.
The driver itself definitely has version errors:
[145cab8] NVIDIA: failed to load the NVIDIA kernel module.\n
[145caf0] NVIDIA: could not create the device file %s\n
[145cb20] NVIDIA: could not open the device file %s (%s).\n
[145cb58] NVIDIA: API mismatch: the NVIDIA kernel module has version %s,\n
but this NVIDIA driver component has version %s. Please make\n
sure that the kernel module and all NVIDIA driver components\n
have the same version.\n
[145cc30] NVIDIA: API mismatch: this NVIDIA driver component has version\n
%s, but the NVIDIA kernel module's version does not match.\n
Please make sure that the kernel module and all NVIDIA driver\n
components have the same version.\n
[145cd10] NVIDIA: could not create file for device %u\n
These probably refer to libcuda and libcudart, not the kernel drivers.
libcuda
Uh-huh, that's what I meant by the "userspace driver"
nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment.
Right, I recall seeing that. I suppose we should change that. Do you know any reason not to?
I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself
There is some leeway for libcuda and the kernel module to diverge which is why cudaPackages.cuda_compat
exists, but they only test and officially support this for chosen platforms (jetsons and datacenters). EDIT: I suppose we could expect some software blocks in nvidia_x11
as well
which is why
cudaPackages.cuda_compat
exists
I didn't realize that is literally the cuda userspace libraries from a newer driver release. The documentation about compatibility are quite comprehensive: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title
Issue description
We're linking both OpenGL and CUDA applications to libGL and to libcuda through an impure path,
/run/opengl-driver/lib
, deployed by NixOS. This path is substituted onnixos-rebuild switch
together with the rest of the system, in which case the userspace drivers may diverge (e.g. afternix flake update
or after updating the channels) from the respective kernel modules. In case of libcuda, we want to keep using the driver from the/run/booted-system
, rather than from the/run/current-system
, or the user may observe errors like:...until they reboot
mesa vs cuda
It may not be sufficient to move
/run/opengl-driver/lib
to/run/booted-system
. From matrix:how mesa breaks
I'm not sure if this is the kind of error K900 was warning about, I tried approximately the following sequence:
I'll update with a reproducible example later
Notify maintainers
@NixOS/cuda-maintainers