NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.06k stars 14.08k forks source link

libcuda.so: driver mismatch on nixos-rebuild switch #255070

Open SomeoneSerge opened 1 year ago

SomeoneSerge commented 1 year ago

Issue description

We're linking both OpenGL and CUDA applications to libGL and to libcuda through an impure path, /run/opengl-driver/lib, deployed by NixOS. This path is substituted on nixos-rebuild switch together with the rest of the system, in which case the userspace drivers may diverge (e.g. after nix flake update or after updating the channels) from the respective kernel modules. In case of libcuda, we want to keep using the driver from the /run/booted-system, rather than from the /run/current-system, or the user may observe errors like:

❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
...
❯ python
>>> import torch
>>> torch.cuda.is_available()
CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /build/source/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

...until they reboot

mesa vs cuda

It may not be sufficient to move /run/opengl-driver/lib to /run/booted-system. From matrix:

K900 ⚡️ Mesa needs libgbm to match the driver And Nvidia needs the driver to match the kernelspace But Mesa can't have a 1:1 compatibility with the kernelspace because they don't own it Someone (UTC+3) "the driver" meaning the userspace bit? K900 ⚡️ Yes Actually if we ever figure out dynamic GBM we could have it use the booted drivers for everything But then it would just as easily be able to use the new driver So it's like Still weird

how mesa breaks

I'm not sure if this is the kind of error K900 was warning about, I tried approximately the following sequence:

$ nix flake update
$ nixos-rebuild switch
$ # now /run/current-system and /run/booted-system are different,
$ # in particular nvidia-smi is broken and complains about the driver mismatch,
$ # but OpenGL apps still work correctly, e.g.:
$ kitty
$ # Now let's restore the old /run/opengl-driver/lib/libcuda.so:
$ sudo /run/booted-system/activate
$ # ...after which CUDA apps work again:
$ nvidia-smi
$ # ...but OpenGL apps are broken:
$ kitty
[258 18:52:45.973274] [glfw error 65543]: GLX: Failed to create context: BadValue (integer parameter out of range for operation)
[258 18:52:45.973294] Failed to create GLFW temp window! This usually happens because of old/broken OpenGL drivers. kitty requires working OpenGL 3.3 drivers.
$ # Now whatever the difference between `activate` and `switch-to-configuration`, recover the OpenGL apps too:
$ sudo /run/booted-system/bin/switch-to-configuration switch

I'll update with a reproducible example later

Notify maintainers

@NixOS/cuda-maintainers

Kiskae commented 1 year ago

I've actually been looking for a solution to the "update-causes-version-mismatch" to make it possible to backport nvidia driver updates.

What I've been considering is a variant of the /etc/static link dance in combination with tmpfiles rules to link the active userspace library to the loaded kernel module.

Like this:

/run/opengl-driver/lib/libcuda.so.1 -> /run/nvidia/current/lib/libcuda.so.1
/run/nvidia/current -> /run/nvidia/<version>
/run/nvidia/<version> -> /nix/store/nvidia-x11-<version>-<hash>

What this will require is:

  1. A way to rewrite symlinks to a different base path to turn /nix/store/nvidia-x11-<version>-<hash> into /run/nvidia/current. This mirrored derivation then gets added to hardware.opengl.extraPackages
  2. tmpfiles rules that set up the /run/nvidia symlinks for both the version and current at boot. This can be expanded to a udev rule to support runtime upgrades of the nvidia driver.
  3. A way to link the current /run/nvidia/current symlink into a gc root so it doesn't get garbage collected.

Note that /run/nvidia is a placeholder and could probably use a more unique nix-specific name.

SomeoneSerge commented 1 year ago

@Kiskae I was actually thinking in a similar direction! Specifically, we could keep track of every deployed configuration's drivers by exposing the package derivation from nixos/modules/hardware/opengl.nix using systemPackages and pathsToLink:

let
  package = pkgs.buildEnv {
    name = "drivers";
    paths = [ config.hardware.opengl.package ] ++ config.hardware.opengl.extraPaths;
    postBuild = ''
      mkdir drivers
      mv * drivers/
    '';
  };
in
{
  environment.systemPackages = [ package ];
  environment.pathsToLink = [ "/drivers" ];
}

With this, we'd have access to (NB "booted") /run/booted-system/sw/drivers/lib/libcuda.so and to (NB "current") /run/current-system/sw/drivers/lib/{dri,gbm,...} (whatever breaks alacritty in the example above), which we could symlink to from /run/opengl-driver/lib. I'd say this feels brittle and too many symlinks to me at a glance, but it's something we definitely could make to work.

Observation: with this solution people switching from hardware.opengl.enable = false to true won't be able to use CUDA apps without a reboot, because /run/booted-system will stay the same, i.e. it won't contain any libcuda.so

Kiskae commented 1 year ago

The risk is that by tying the nvidia driver to booted-system its own dependencies might become outdated compared to the current active profile. As it currently exists the only desync happens between kernelspace and userspace which is generally stable. (unless you're named NVIDIA)

That is why I'm considering the symlink indirection, since it will allow updates to the nvidia driver closure as long as the nvidia driver remains on the same version. In addition you could add a warning in switch-to-derivation if the nvidia driver no longer matches to alert the user.

EDIT: I seemed to recall seeing a PR related to moving /run/opengl-driver into the system closure, it appears to be https://github.com/NixOS/nixpkgs/pull/158079

SomeoneSerge commented 1 year ago

@Kiskae maybe I didn't make myself clear, but I was trying to suggest that we'd have both /run/current-system/sw/drivers and /run/booted-system/sw/drivers: one corresponds to the last switched-to configuration, and the latter corresponds to the configuration booted from. Then we'd make all of /run/opengl-driver/lib link to /run/current-system/sw/drivers (equivalent to what we do now), except for libcuda.so (and, maybe, libnvidia-ml.so - as much as would be required to make cuda work), which we'd make to point at the old/booted system instead

EDIT: RE: https://github.com/NixOS/nixpkgs/pull/158079

Wonderful! I forgot that wasn't just about naming the nixos option. So we might just merge that PR, and then make the indirection in addOpenGLRunpath.driverLink more granular on NixOS, by redirecting chosen libraries to the /run/booted-system/drivers instead of /run/current-system/drivers

Kiskae commented 1 year ago

I understood that part, what I'm talking about is the more complex libraries like the vulkan driver which depends on other libraries. So if libnvidia-vulkan* depends on libgbm and is loaded from boot, but something else is linked with a newer version of mesa which has a different version of libgbm then that can cause issues with dynamic loading.

Mind you this exact thing would still happen in my solution when the version of the driver changes, but as long as the version remains the same the nvidia driver closure can be updated in sync with the rest of the system.

Essentially there are two ways the driver can cause issues:

  1. If the driver is newer than the kernel module, it stops working.
  2. If the driver closure is older than the system closure, it might include conflicting dynamic dependencies.
SomeoneSerge commented 1 year ago

libnvidia-vulkan* ... is loaded from boot

But do we need to load libnvidia-vulkan* from "boot"? Why don't we load it from /run/current-system instead, this seems to have worked so far, and there's only two libraries (cuda, nvml) I know by now we might want to load from /run/booted-system

Kiskae commented 1 year ago

But do we need to load libnvidia-vulkan* from "boot"?

Yup same issue as libcuda, almost all driver libraries will crash if the kernel module doesn't match.

❯ find -L /run/opengl-driver/lib -name "lib*535*" -exec fgrep "API mismatch" {} +
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libcuda.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvcuvid.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-allocator.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-cfg.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-eglcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glsi.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-ml.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-opencl.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/vdpau/libvdpau_nvidia.so.535.86.05: binary file matches

the nvidia vulkan driver is actually lib(GLX|EGL)_nvidia which depend on libnvidia-e?glcore.

Atemu commented 11 months ago

Related: https://github.com/NixOS/nixpkgs/issues/269419

SomeoneSerge commented 11 months ago

There's one more thing we've missed: nixos-rebuild switch doesn't actually break CUDA all that often (I think the heuristic is that libcuda.so needs be at least as new as the kernel module, and it's usually OK if it's newer), but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 545.29
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L cudaPackages.saxpy
Start
Runtime version: 11080
Driver version: 12030
Host memory initialized, copying to the device
Scheduled a cudaMemcpy, calling the kernel
Scheduled a kernel call
Max error: 0.000000
Kiskae commented 11 months ago

but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment. So that error is coming from the 'newer' libnvidia-ml.so.

Runtime version: 11080 Driver version: 12030

These probably refer to libcuda and libcudart, not the kernel drivers. However the most recent update was 545.29.02 -> 545.29.06, so it might very well be that the cuda driver is the same on these releases.

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself.

The driver itself definitely has version errors:

  [145cab8]  NVIDIA: failed to load the NVIDIA kernel module.\n
  [145caf0]  NVIDIA: could not create the device file %s\n
  [145cb20]  NVIDIA: could not open the device file %s (%s).\n
  [145cb58]  NVIDIA: API mismatch: the NVIDIA kernel module has version %s,\n
            but this NVIDIA driver component has version %s.  Please make\n
            sure that the kernel module and all NVIDIA driver components\n
            have the same version.\n
  [145cc30]  NVIDIA: API mismatch: this NVIDIA driver component has version\n
            %s, but the NVIDIA kernel module's version does not match.\n
            Please make sure that the kernel module and all NVIDIA driver\n
            components have the same version.\n
  [145cd10]  NVIDIA: could not create file for device %u\n
SomeoneSerge commented 11 months ago

These probably refer to libcuda and libcudart, not the kernel drivers.

Yes: https://github.com/NixOS/nixpkgs/blob/1a6f704d3a05efba4f1f55f69f4bab5c188f8cc4/pkgs/development/cuda-modules/saxpy/saxpy.cu#L27-L31

libcuda

Uh-huh, that's what I meant by the "userspace driver"

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment.

Right, I recall seeing that. I suppose we should change that. Do you know any reason not to?

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself

There is some leeway for libcuda and the kernel module to diverge which is why cudaPackages.cuda_compat exists, but they only test and officially support this for chosen platforms (jetsons and datacenters). EDIT: I suppose we could expect some software blocks in nvidia_x11 as well

Kiskae commented 11 months ago

which is why cudaPackages.cuda_compat exists

I didn't realize that is literally the cuda userspace libraries from a newer driver release. The documentation about compatibility are quite comprehensive: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title