NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.9k stars 215 forks source link

WSL NixOS `cdi generate` Error: failed to initialize dxcore context #452

Open Samiser opened 2 months ago

Samiser commented 2 months ago
❯ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.15.0-rc.3

i'm attempting to use nvidia-ctk to generate a CDI spec in WSL running NixOS, but am getting the following error:

❯ nvidia-ctk cdi generate --nvidia-ctk-path /run/current-system/sw/bin/nvidia-ctk --ldconfig-path /run/current-system/sw/bin/ldconfig --mode wsl
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

if i generate the CDI spec on a different VM and use that config directly (only changing the location of nvidia-ctk) then nvidia-ctk successfully finds the device and i can use it in containers:

nvidia-container-toolkit.json (click to expand) ``` { "cdiVersion": "0.3.0", "containerEdits": { "hooks": [ { "args": [ "nvidia-ctk", "hook", "create-symlinks", "--link", "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi::/usr/bin/nvidia-smi" ], "hookName": "createContainer", "path": "/run/current-system/sw/bin/nvidia-ctk" }, { "args": [ "nvidia-ctk", "hook", "update-ldcache", "--folder", "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69", "--folder", "/usr/lib/wsl/lib" ], "hookName": "createContainer", "path": "/run/current-system/sw/bin/nvidia-ctk" } ], "mounts": [ { "containerPath": "/usr/lib/wsl/lib/libdxcore.so", "hostPath": "/usr/lib/wsl/lib/libdxcore.so", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda.so.1.1", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda.so.1.1", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda_loader.so", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda_loader.so", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml.so.1", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml.so.1", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml_loader.so", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml_loader.so", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ptxjitcompiler.so.1", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ptxjitcompiler.so.1", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvcubins.bin", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvcubins.bin", "options": [ "ro", "nosuid", "nodev", "bind" ] }, { "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi", "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi", "options": [ "ro", "nosuid", "nodev", "bind" ] } ] }, "devices": [ { "containerEdits": { "deviceNodes": [ { "path": "/dev/dxg" } ] }, "name": "all" } ], "kind": "nvidia.com/gpu" } ```

i've also tried populating every other flag with the locations of the files in /usr/lib/wsl/ but that didn't make a difference, i assume that's handled by --mode wsl

here's the relevant nix config if it helps (ommitting nixos-wsl import section):

{
  wsl.enable = true;

  environment.systemPackages = with pkgs; [ nvidia-container-toolkit ];

  virtualisation.podman.enable = true;
  virtualisation.containers.cdi.dynamic.nvidia.enable = true;

  programs.nix-ld.enable = true;

  environment.variables = lib.mkForce {
    NIX_LD_LIBRARY_PATH = "/usr/lib/wsl/lib/";
    NIX_LD = "${pkgs.glibc}/lib/ld-linux-x86-64.so.2";
  };
}

and here's the gpu working with the manual config:

❯ nvidia-ctk cdi list
INFO[0000] Found 1 CDI devices
nvidia.com/gpu=all

❯ podman run --device nvidia.com/gpu=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -benchmark -gpu
--- cut ---
> Compute 7.5 CUDA device: [NVIDIA GeForce RTX 2070]
36864 bodies, total time for 10 iterations: 65.098 ms
= 208.756 billion interactions per second
= 4175.121 single-precision GFLOP/s at 20 flops per interaction

let me know if there's any more information i can provide!

elezar commented 2 months ago

@Samiser could you run:

nvidia-ctk --debug cdi generate

I assume that the utility does not find libdxcore.so by itself, meaning that the mode needs to be explicitly set.

Note that we do use dlopen to load libdxcore.so, so you could try setting LD_PRELOAD=${PATH_TO_LIB}/libdxcore.so explicitly. This should help both the autodetection and the generation.

I would have to look at how to make this more robust.

Samiser commented 2 months ago

sure, here is the debug output both with --mode wsl and without:

~
❯ nvidia-ctk --debug cdi generate --mode wsl
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

~
❯ nvidia-ctk --debug cdi generate
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory
DEBU[0000] Is NVML-based system? false: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
DEBU[0000] Is Tegra-based system? false: /sys/devices/soc0/family file not found
INFO[0000] Auto-detected mode as "nvml"
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND

also, setting LD_PRELOAD doesn't seem to help:

~
❯ ls /usr/lib/wsl/lib/libdxcore.so
/usr/lib/wsl/lib/libdxcore.so

~
❯ LD_PRELOAD=/usr/lib/wsl/lib/libdxcore.so nvidia-ctk --debug cdi generate --mode wsl
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

also using the flag --library-search-path doesn't seem to help either:

~
❯ nvidia-ctk --debug cdi generate --mode wsl --library-search-path /usr/lib/wsl/lib/
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context
loicreynier commented 1 month ago

As mentioned in https://github.com/NixOS/nixpkgs/pull/312253 and https://github.com/nix-community/NixOS-WSL/issues/433, you should either use the wsl.useWindowsDriver option from NixOS-WSL or use LD_LIBRARY_PATH=/usr/lib/wsl/lib when generating the CDI.

scruel commented 1 week ago
➜ sudo LD_LIBRARY_PATH=/usr/lib/wsl/lib nvidia-ctk --debug cdi generate --mode wsl  --output=/etc/cdi/nvidia.yaml
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Checking candidate '/usr/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/usr/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path /usr/bin/nvidia-ctk
DEBU[0000] Inferred output format as "yaml" from output file name
DEBU[0000] Locating /dev/dxg
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
INFO[0000] Using WSL driver store paths: [/usr/lib/wsl/drivers/iigd_dch.inf_amd64_73655f941b1dd71f /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6]
WARN[0000] Found multiple driver store paths: [/usr/lib/wsl/drivers/iigd_dch.inf_amd64_73655f941b1dd71f /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6]
DEBU[0000] Using specified NVIDIA Container Toolkit CLI path /usr/bin/nvidia-ctk
DEBU[0000] Locating libcuda.so.1.1
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda.so.1.1'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libcuda.so.1.1 as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda.so.1.1]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda.so.1.1 as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda.so.1.1
DEBU[0000] Locating libcuda_loader.so
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda_loader.so'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libcuda_loader.so as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda_loader.so]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda_loader.so as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libcuda_loader.so
DEBU[0000] Locating libnvidia-ptxjitcompiler.so.1
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ptxjitcompiler.so.1'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libnvidia-ptxjitcompiler.so.1 as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ptxjitcompiler.so.1]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ptxjitcompiler.so.1 as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ptxjitcompiler.so.1
DEBU[0000] Locating libnvidia-ml.so.1
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml.so.1'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libnvidia-ml.so.1 as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml.so.1]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml.so.1 as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml.so.1
DEBU[0000] Locating libnvidia-ml_loader.so
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml_loader.so'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libnvidia-ml_loader.so as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml_loader.so]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml_loader.so as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/libnvidia-ml_loader.so
DEBU[0000] Locating libdxcore.so
DEBU[0000] Checking candidate '/usr/lib/wsl/lib/libdxcore.so'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located libdxcore.so as [/usr/lib/wsl/lib/libdxcore.so]
INFO[0000] Selecting /usr/lib/wsl/lib/libdxcore.so as /usr/lib/wsl/lib/libdxcore.so
DEBU[0000] Locating nvcubins.bin
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvcubins.bin'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located nvcubins.bin as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvcubins.bin]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvcubins.bin as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvcubins.bin
DEBU[0000] Locating nvidia-smi
DEBU[0000] Checking candidate '/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvidia-smi'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Located nvidia-smi as [/usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvidia-smi]
INFO[0000] Selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvidia-smi as /usr/lib/wsl/drivers/nvlti.inf_amd64_9a2c79b60d6607c6/nvidia-smi
DEBU[0000] returning cached mounts
DEBU[0000] returning cached mounts
INFO[0000] Generated CDI spec with version 0.3.0
➜ nvidia-ctk cdi list
No help topic for 'list'
➜ podman run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi
Error: stat nvidia.com/gpu=all: no such file or directory