Yeshey / nixos-nvidia-vgpu

NixOS NVIDIA vGPU Module
MIT License
17 stars 7 forks source link

Support for linux GRID guest drivers for nixOS #5

Open V3ntus opened 3 months ago

V3ntus commented 3 months ago

I'm installing this on a NixOS 24.11 guest under Proxmox which has a proper GRID setup going already. Log file:

warning: Git tree '/home/joe/repos/nixos' is dirty
trace: warning: system.stateVersion is not set, defaulting to 24.11. Read why this matters on https://nixos.org/manual/nixos/stable/options.html#opt-system.stateVersion.
error: builder for '/nix/store/aykbf37pd4fbcs980a8pkr72sa7w4pmy-NVIDIA-Linux-x86_64-535.161.07-merged-vgpu-kvm-patched-6.6.37.drv' failed with exit code 2;
       last 10 log lines:
       > Creating directory NVIDIA-Linux-x86_64-535.161.07-grid
       > Verifying archive integrity... OK
       > Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.161.07...
       > source root is NVIDIA-Linux-x86_64-535.161.07-grid
       > calling 'postUnpack' function hook '_updateSourceDateEpochFromSourceRoot'
       > setting SOURCE_DATE_EPOCH to timestamp 1708216066 of file NVIDIA-Linux-x86_64-535.161.07-grid/nvidia-bug-report.sh
       > Running phase: patchPhase
       > evaling implicit 'postPatch' string hook
       > sed: can't read nvidia-vgpud: No such file or directory
       > /nix/store/5r0df66ikad3xw06azlqvswcvncll8wa-stdenv-linux/setup: line 193: pop_var_context: head of shell_variables not a function context
       For full logs, run 'nix log /nix/store/aykbf37pd4fbcs980a8pkr72sa7w4pmy-NVIDIA-Linux-x86_64-535.161.07-merged-vgpu-kvm-patched-6.6.37.drv'.
error: 1 dependencies of derivation '/nix/store/l4xlzsw1l1brxqm01lv72m3x2fibln8g-etc.drv' failed to build
error (ignored): error: cannot unlink '/tmp/nix-build-mdevctl-1.2.0.drv-0/build/mdevctl-1.2.0.tar.gz/target/release/deps': Directory not empty
error: 1 dependencies of derivation '/nix/store/gsh80pjmblv7z6rydbp1ah8bkzcbrxl2-nixos-system-ai-24.11.20240708.655a58a.drv' failed to build

Config:

{
  hardware.nvidia.vgpu = {
    enable = true;
    # pinKernel = true;

    useMyDriver = {
      enable = true;
      name = "NVIDIA-Linux-x86_64-535.161.07-grid.run";
      sha256 = "sha256-o8dyPjc09cdigYWqkWJG6H/AP71bH65pfwFTS/7V9GM=";
      driver-version = "535.161.07";
      getFromRemote = pkgs.fetchurl {
        name = hardware.nvidia.vgpu.useMyDriver.name;
        url = "https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU16.4/NVIDIA-Linux-x86_64-535.161.07-grid.run";
        sha256 = hardware.nvidia.vgpu.useMyDriver.sha256;
      };
    };
  };
}
V3ntus commented 2 months ago

Instead of this module, I resorted to doing a manual nvidiaPackages.mkDriver for now. I think it works? The driver compiles and installs at least, and I could see the vGPU as the qemu guest.

let
  nvidiaVersion = "535.161.07";
in {
    hardware.nvidia = {
    modesetting.enable = true;

    powerManagement.enable = false;
    powerManagement.finegrained = false;

    nvidiaSettings = false;

    # Explicitly use the GRID drivers from NVIDIA
    package = config.boot.kernelPackages.nvidiaPackages.mkDriver {
      version = nvidiaVersion;
      url = "https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU16.4/NVIDIA-Linux-x86_64-${nvidiaVersion}-grid.run";
      sha256_64bit = "sha256-o8dyPjc09cdigYWqkWJG6H/AP71bH65pfwFTS/7V9GM=";
      useSettings = false;
      usePersistenced = false;
    };
  };
}
V3ntus commented 2 months ago

I had issues with licensing. I run FastAPI-DLS on the Proxmox host, copied over the license it generated, but could never get it out of an "unlicensed" state. I believe it is related to this issue where vgpud is missing as this is a service registered in this repo.

mrzenc commented 2 months ago

This module is not intended for use in virtual machines, but only on the host to achieve vGPU capabilities on customer GPUs. That's why you receive such errors. Manual nvidiaPackages.mkDriver you mentioned should be used instead.

Also, the issues with FastAPI-DLS are not related to this module at all. I think you should ask for support in a more appropriate place.

Yeshey commented 2 months ago

@V3ntus hey, sorry for the late response, this module is indeed only meant for the host machine, we probably should make that clearer in the README. In the same vein, it sets up the FastAPI-DLS server, which you already have on proxmox, I only ever tried with guest windows machines, I imagine some more stuff would have to be done in a guest Linux machine for it to pick up on this licensing server, I'm not aware of the process.

Did you manage to make it work with mkDriver? The vgpu community repo has these options that seem to compile the driver for a guest Linux machine:

# driver for linux vm
./patch.sh grid
# driver for linux vm functionally similar to grid one but using consumer .run as input
./patch.sh general

~It shouldn't be that hard to make it compile the driver with those options instead (famous last words), if you still need it I can try to make it compile with those options in a new branch and you could see if it works for you, bc i dont have a setup ready to test that, and maybe an option for guest drivers could be added~

So after talking with @mrzenc, the ./patch.sh grid and ./patch.sh general are probably special patches so the guest works with Q profiles (the profiles that give guests Cuda support). For other drivers the normal grid driver on the guest would be the way to go

Yeshey commented 2 months ago

There doesn't seem to be any module for NixOS to install GRID drivers (drivers for Linux guest) that I can easily find.

I've been talking with @mrzenc and here's some stuff that they've uncovered:

There is some work done in the init scripts of the GRID driver that would have to be handled manually in NixOS besides just installing it with for example mkDriver, like some services that it installs and are absent in the default driver.

├── init-scripts
│   ├── common.sh
│   ├── post-install
│   ├── pre-uninstall
│   ├── systemd
│   │   ├── nvidia-gridd.service
│   │   └── nvidia-topologyd.service
│   ├── sysv
│   │   ├── nvidia-gridd
│   │   └── nvidia-topologyd
│   └── upstart
│       ├── nvidia-gridd.conf
│       └── nvidia-topologyd.conf

Quoting @mrzenc:

I looked into the scripts under init-scripts and here is what I can say:

  1. It detects what service manager is installed. In the case of NixOS, it is always systemd.
  2. It installs nvidia-gridd service and nvidia-gridd binary. It also installs some extra files (gridd.conf.template and grid-proxy-credentials.sh) into /etc/nvidia most likely.
  3. If the system has a 64-bit non-ARM architecture (x86_64), then it also installs nvidia-topologyd service and binary.

It also does some DBus configuration for nvidia-gridd:

<busconfig>
  <type>system</type>
  <policy context="default">
    <allow own="nvidia.grid.server"/>
    <allow own="nvidia.grid.client"/>
    <allow send_requested_reply="true" send_type="method_return"/>
    <allow send_requested_reply="true" send_type="error"/>
    <allow receive_requested_reply="true" receive_type="method_return"/>
    <allow receive_requested_reply="true" receive_type="error"/>
    <allow send_destination="nvidia.grid.server"/>
    <allow receive_sender="nvidia.grid.client"/>
  </policy>
</busconfig>
V3ntus commented 2 months ago

The mkDriver did not work, as you have pointed out there would be modifications needed as the GRID drivers include extra needed stuff.

Yeshey commented 2 months ago

I'm changing this issue title into a discussion on guest drivers for nixOS guests, as there doesn't seem to be any easy way to install the guest drivers on nixOS right now. I think those efforts would probably be better off in their own repository tho, as it's a little outside the vgpu unlocking shenanigans that this repo is about and not really dependent. There doesn't seem to be much developer interest in it rn tho, but if that comes to be i'd make a mention of it in the Linux guest section in the README