Getting GPU device minor number: Not Supported

1. Issue or feature description

helm install nvidia-device-plugin

 helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2

nvidia-device-plugin-ctr logs

2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "index"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported

goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
    /build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
    /build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
    /build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
    /build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
    /build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
    /build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
    /build/cmd/nvidia-device-plugin/main.go:91 +0x665

When I use ctr to run test gpu is ok

ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host

nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Sep  6 15:30:06 2022
Driver Version                            : 516.94
CUDA Version                              : 11.7

Attached GPUs                             : 1
GPU 00000000:01:00.0
  Product Name                          : NVIDIA GeForce GTX 1060 3GB
  Product Brand                         : GeForce
  Product Architecture                  : Pascal
  Display Mode                          : Enabled
  Display Active                        : Enabled
  Persistence Mode                      : Enabled
  MIG Mode
      Current                           : N/A
      Pending                           : N/A
  Accounting Mode                       : Disabled
  Accounting Mode Buffer Size           : 4000
  Driver Model
      Current                           : WDDM
      Pending                           : WDDM
  Serial Number                         : N/A
  GPU UUID                              : GPU-9445de88-eb50-477d-ff7c-5e0d77cdb203
  Minor Number                          : N/A
  VBIOS Version                         : 86.06.3c.00.2e
  MultiGPU Board                        : No
  Board ID                              : 0x100
  GPU Part Number                       : N/A
  Module ID                             : 0
  Inforom Version
      Image Version                     : G001.0000.01.04
      OEM Object                        : 1.1
      ECC Object                        : N/A
      Power Management Object           : N/A
  GPU Operation Mode
      Current                           : N/A
      Pending                           : N/A
  GSP Firmware Version                  : N/A
  GPU Virtualization Mode
      Virtualization Mode               : None
      Host VGPU Mode                    : N/A
  IBMNPU
      Relaxed Ordering Mode             : N/A
  PCI
      Bus                               : 0x01
      Device                            : 0x00
      Domain                            : 0x0000
      Device Id                         : 0x1C0210DE
      Bus Id                            : 00000000:01:00.0
      Sub System Id                     : 0x11C210DE
      GPU Link Info
          PCIe Generation
              Max                       : 3
              Current                   : 3
          Link Width
              Max                       : 16x
              Current                   : 16x
      Bridge Chip
          Type                          : N/A
          Firmware                      : N/A
      Replays Since Reset               : 0
      Replay Number Rollovers           : 0
      Tx Throughput                     : 0 KB/s
      Rx Throughput                     : 8000 KB/s
  Fan Speed                             : 42 %
  Performance State                     : P5
  Clocks Throttle Reasons
      Idle                              : Active
      Applications Clocks Setting       : Not Active
      SW Power Cap                      : Not Active
      HW Slowdown                       : Not Active
          HW Thermal Slowdown           : Not Active
          HW Power Brake Slowdown       : Not Active
      Sync Boost                        : Not Active
      SW Thermal Slowdown               : Not Active
      Display Clock Setting             : Not Active
  FB Memory Usage
      Total                             : 3072 MiB
      Reserved                          : 84 MiB
      Used                              : 2407 MiB
      Free                              : 580 MiB
  BAR1 Memory Usage
      Total                             : 256 MiB
      Used                              : 2 MiB
      Free                              : 254 MiB
  Compute Mode                          : Default
  Utilization
      Gpu                               : 3 %
      Memory                            : 5 %
      Encoder                           : 0 %
      Decoder                           : 0 %
  Encoder Stats
      Active Sessions                   : 0
      Average FPS                       : 0
      Average Latency                   : 0
  FBC Stats
      Active Sessions                   : 0
      Average FPS                       : 0
      Average Latency                   : 0
  Ecc Mode
      Current                           : N/A
      Pending                           : N/A
  ECC Errors
      Volatile
          Single Bit            
              Device Memory             : N/A
              Register File             : N/A
              L1 Cache                  : N/A
              L2 Cache                  : N/A
              Texture Memory            : N/A
              Texture Shared            : N/A
              CBU                       : N/A
              Total                     : N/A
          Double Bit            
              Device Memory             : N/A
              Register File             : N/A
              L1 Cache                  : N/A
              L2 Cache                  : N/A
              Texture Memory            : N/A
              Texture Shared            : N/A
              CBU                       : N/A
              Total                     : N/A
      Aggregate
          Single Bit            
              Device Memory             : N/A
              Register File             : N/A
              L1 Cache                  : N/A
              L2 Cache                  : N/A
              Texture Memory            : N/A
              Texture Shared            : N/A
              CBU                       : N/A
              Total                     : N/A
          Double Bit            
              Device Memory             : N/A
              Register File             : N/A
              L1 Cache                  : N/A
              L2 Cache                  : N/A
              Texture Memory            : N/A
              Texture Shared            : N/A
              CBU                       : N/A
              Total                     : N/A
  Retired Pages
      Single Bit ECC                    : N/A
      Double Bit ECC                    : N/A
      Pending Page Blacklist            : N/A
  Remapped Rows                         : N/A
  Temperature
      GPU Current Temp                  : 45 C
      GPU Shutdown Temp                 : 102 C
      GPU Slowdown Temp                 : 99 C
      GPU Max Operating Temp            : N/A
      GPU Target Temperature            : 83 C
      Memory Current Temp               : N/A
      Memory Max Operating Temp         : N/A
  Power Readings
      Power Management                  : Supported
      Power Draw                        : 12.16 W
      Power Limit                       : 120.00 W
      Default Power Limit               : 120.00 W
      Enforced Power Limit              : 120.00 W
      Min Power Limit                   : 60.00 W
      Max Power Limit                   : 140.00 W
  Clocks
      Graphics                          : 683 MHz
      SM                                : 683 MHz
      Memory                            : 810 MHz
      Video                             : 607 MHz
  Applications Clocks
      Graphics                          : N/A
      Memory                            : N/A
  Default Applications Clocks
      Graphics                          : N/A
      Memory                            : N/A
  Max Clocks
      Graphics                          : 1911 MHz
      SM                                : 1911 MHz
      Memory                            : 4004 MHz
      Video                             : 1708 MHz
  Max Customer Boost Clocks
      Graphics                          : N/A
  Clock Policy
      Auto Boost                        : N/A
      Auto Boost Default                : N/A
  Voltage
      Graphics                          : N/A
  Processes                             : None

[ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
[ ] The k8s-device-plugin container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[ ] Any relevant kernel output lines from dmesg

[ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.10.0-1     all          NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook

[ ] NVIDIA container library version from nvidia-container-cli -V

nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

nvidia-container-cli list  
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so

continaerd config containerd.toml

[plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvdia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
        Runtime = "nvidia-container-runtime"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runtime.v1.linux"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = false

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

All right， I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well

@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well

Hi @elezar , in case it's unlikely to appear on the roadmap soon, could you please describe a rough plan of how the support should be added? And whether executing the plan would be doable by outside contributors? Thanks!

@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:

we detect whether this is a WSL2 system (e.g. by checking for the presence of dxcore.so.1)
modify / extend the NVML resource manager to create a device that does not require the device minor number.

Some things to note here:

On WSL2 systems there is currently no option to select specific devices. This means that the available devices should be treated as a set and cannot be assigned to different containers. The other thing to note is that the device node (for use with the CPU manager workaround) on WSL2 systems is /dev/dxg and not /dev/nvidia*.

If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in.

Hello,

I was interested in this, and I adapted the plugin to work. I pushed my version to GitLab (https://gitlab.com/Vinrobot/nvidia-kubernetes-device-plugin/-/tree/features/wsl2) and it works on my machine. I also had to modify NVIDIA/gpu-monitoring-tools (https://github.com/Vinrobot/nvidia-gpu-monitoring-tools/tree/features/wsl2) to also use /dev/dxg.

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

@Vinrobot thanks for the work here. Some thoughts on this:

We recently moved away from nvidia-gpu-monitoring-tools and use bindings from go-nvml through go-nvlib instead.

I think the steps outlined in https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1262288033 should be considered as the starting point. Check if dxcore.so.1 is available and if it is assume a WSL2 system (one could also check for the existence of /dev/dxg here). In this case, create wslDevice that implements the deviceInfo Interface and ensure that this gets instatiated when enumerating devices. This can then return 0 for the minor number and return the correct path.

With regards to the following:

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack.

Hi @elezar, Thanks for the feedback.

I tried to make it work with the most recent version, but I got this error (on the pod)

Warning  UnexpectedAdmissionError  30s   kubelet            Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: unsupported GPU device, which is unexpected

which is caused by this line in gpu-monitoring-tools (still used by gpuallocator).

As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal. Anyway, I will look into it tomorrow.

@Vinrobot yes, it is an issue that gpuallocator still uses gpu-monitoring-tools. It is on our roadmap to port it to the go-nvml bindings, but this is not yet complete.

The issue is the call to get alligned allocation here. (You can confirm this by removing this section).

If this does workd, what we would need is a mechanism to disable this for WSL2 devices.

One option would be to add a AllignedAllocationSupported() bool function to the Devices and Device types. This could look something like:

// AllignedAllocationSupported checks whether all devices support an alligned allocation
func (ds Devices) AllignedAllocationSupported() bool {
    for _, d := range ds {
        if !d.AllignedAllocationSupported() {
            return false
        }
    }
    return true
}

// AllignedAllocationSupported checks whether the device supports an alligned allocation
func (d Device) AllignedAllocationSupported() bool {
    if d.IsMigDevice() {
        return false
    }

    for _, p := range d.Paths {
        if p == "/dev/dgx" {
            return false
        }
    }

    return true
}

(Note that this should still be discussed and could definitely be improved, but would be a good starting point).

Hi @elezar,

I'm also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291

Would be great to get those changes in.

Thanks @achim92 -- I will have a look at the MR.

Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes.

hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.

Thanks @elezar,

would be even better without requiring additional device plugin changes.

I have generated cdi with nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml:

cdiVersion: 0.3.0
containerEdits:
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - create-symlinks
    - --link
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi::/usr/bin/nvidia-smi
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4
    - --folder
    - /usr/lib/wsl/lib
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/lib/libdxcore.so
    hostPath: /usr/lib/wsl/lib/libdxcore.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    options:
    - ro
    - nosuid
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    options:
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dxg
  name: all
kind: nvidia.com/gpu

I also removed NVIDIA Container Runtime hook under /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json. How can I enable CDI to make it work? I'm using cri-o as container runtime, so CDI support should be enabled by default.

I0515 08:39:51.471150       1 main.go:154] Starting FS watcher.
I0515 08:39:51.471416       1 main.go:161] Starting OS watcher.
I0515 08:39:51.472727       1 main.go:176] Starting Plugins.
I0515 08:39:51.472771       1 main.go:234] Loading configuration.
I0515 08:39:51.473017       1 main.go:242] Updating config with default resource matching patterns.
I0515 08:39:51.473350       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 08:39:51.473380       1 main.go:256] Retreiving plugins.
W0515 08:39:51.473833       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0515 08:39:51.474021       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0515 08:39:51.474878       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0515 08:39:51.474918       1 factory.go:115] Incompatible platform detected
E0515 08:39:51.474925       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0515 08:39:51.474930       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0515 08:39:51.474934       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0515 08:39:51.474937       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0515 08:39:51.474946       1 main.go:287] No devices found. Waiting indefinitely.

@elezar could you please give some guidance here?

hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.

Hi brother, I've encountered the same issue. Have you managed to solve it?

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

Hi @elezar ,

How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template?

Thanks

@elezar We are also interested in this

I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 would be the right image right?

✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016

WSL environment

WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208

K8S Setup

≥ k3s --version                                                                                                                    
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8

nvidia-smi output in WSL

Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               3W /  40W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

##deleted processes table##

nvidia-device-plugin daemonset pod log

I0725 06:26:03.108417       1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468       1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974       1 main.go:176] Starting Plugins.
I0725 06:26:03.108995       1 main.go:234] Loading configuration.
I0725 06:26:03.109063       1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0725 06:26:03.109219       1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336       1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Test GPU pod output

Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    -fullscreen       (run n-body simulation in fullscreen mode)
    -fp64             (use double precision floating point values for simulation)
    -hostmem          (stores simulation data in host memory)
    -benchmark        (run benchmark to measure performance) 
    -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
    -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    -compare          (compares simulation results running once on the default GPU and once on the CPU)
    -cpu              (run n-body simulation on the CPU)
    -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)

Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !

@davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:

    nvidia.com/gpu-driver-upgrade-state: pod-restart-required
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    nvidia.com/gpu.deploy.nvsm: ''
    nvidia.com/gpu.deploy.operands: 'true'
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/device-plugin.config: 'RTX-4070-Ti'

Change device plugin in ClusterPolicy:

  devicePlugin:
    config:
      name: time-slicing-config
    enabled: true
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
    version: 8b416016

It should work for now:


> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction

I created the "runtimeClassName" resource and added the "runtimeClassName" property to the pods.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia

I did not add those properties you mentioned. Why do I need them?

Thanks

On Tue, 25 Jul 2023 at 19:32, wizpresso-steve-cy-fan < @.***> wrote:

@davidshen84 https://github.com/davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready $ touch /run/nvidia/validations/driver-ready $ mkdir -p /run/nvidia/driver/dev $ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti'
Change device plugin in ClusterPolicy:

devicePlugin: config: name: time-slicing-config enabled: true env:

name: PASS_DEVICE_SPECS value: 'true'

name: FAIL_ON_INIT_ERROR value: 'true'

name: DEVICE_LIST_STRATEGY value: envvar

name: DEVICE_ID_STRATEGY value: uuid

name: NVIDIA_VISIBLE_DEVICES value: all

name: NVIDIA_DRIVER_CAPABILITIES value: all image: k8s-device-plugin imagePullPolicy: IfNotPresent repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging version: 8b416016

It should work for now

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTIBSURYWEGHQ4R5RIDXR6HERANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>

@davidshen84 Because I used the gpu-operator for automatic GPU provision

Thanks for the tip!

On Tue, 25 Jul 2023 at 19:46, wizpresso-steve-cy-fan < @.***> wrote:

@davidshen84 https://github.com/davidshen84 Because I used the gpu-operator https://github.com/NVIDIA/gpu-operator for automatic GPU provision

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649487959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTN72GY4JN43V7XZULLXR6IW7ANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>

I verified the staging imageregistry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 that it is truely working on wsl2.

Based on dockerd

Step 1, install k3s cluster based on dockerd

curl -sfL https://get.k3s.io | sh -s - --docker

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: docker
EOF

# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --namespace nvdp \
    --create-namespace \
    --set=runtimeClassName=nvidia \
    --set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
    --set=image.tag=8b416016

Based on containerd

Step 1, install k3s cluster based on containerd

curl -sfL https://get.k3s.io | sh -

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF

# install nvdp with the same steps as above.

Test with nvdp

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

And, the example cuda-sample-vectoradd can work normally.Waiting for the next working release on wsl2😃😃

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub?

Hi @elezar,

I can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 is working for me. Even my GPU card is Quadro P1000. :) I can move forward to test Koordiator.

itadmin@server:~/repos/k3s-on-wsl2$ cat /proc/version
Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023

itadmin@server:~/repos/k3s-on-wsl2$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Aug 16 06:21:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14   Driver Version: 528.86       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P1000        On   | 00000000:01:00.0  On |                  N/A |
| 34%   39C    P8    N/A /  47W |   1061MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl -n kube-system logs nvidia-device-plugin-daemonset-q642m
I0816 06:20:28.927429       1 main.go:154] Starting FS watcher.
I0816 06:20:28.927534       1 main.go:161] Starting OS watcher.
I0816 06:20:28.927691       1 main.go:176] Starting Plugins.
I0816 06:20:28.927698       1 main.go:234] Loading configuration.
I0816 06:20:28.927762       1 main.go:242] Updating config with default resource matching patterns.
I0816 06:20:28.927936       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0816 06:20:28.927960       1 main.go:256] Retrieving plugins.
I0816 06:20:28.930313       1 factory.go:107] Detected NVML platform: found NVML library
I0816 06:20:28.930362       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0816 06:20:28.947623       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0816 06:20:28.948059       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0816 06:20:28.949737       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      etcd.k3s.cattle.io/node-address: 172.18.88.17
      etcd.k3s.cattle.io/node-name: server-d622491e
      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"52:95:ba:16:e9:29"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 172.18.88.17
      k3s.io/node-args: '["server","--cluster-init","true","--etcd-expose-metrics","true","--disable","traefik","--disable-cloud-controller","true","--docker","true","--kubelet-arg","node-status-update-frequency=4s","--kube-controller-manager-arg","node-monitor-period=2s","--kube-controller-manager-arg","node-monitor-grace-period=16s","--kube-apiserver-arg","default-not-ready-toleration-seconds=20","--kube-apiserver-arg","default-unreachable-toleration-seconds=20","--write-kubeconfig","/home/itadmin/.kube/config","--private-registry","/etc/rancher/k3s/registry.yaml","--flannel-iface","eth0","--bind-address","172.18.88.17","--https-listen-port","6443","--advertise-address","172.18.88.17","--log","/var/log/k3s-server.log"]'
      k3s.io/node-config-hash: IDWWDZRIJO5DHZKGYYHONVZC2DN7TK7THKPSONCFR74ST4LAGNGQ====
      k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c26e7571d760c5f199d18efd197114f1ca4ab1e6ffe494f96feb65c87fcb8cf0"}'
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: "2023-08-16T05:47:03Z"
    finalizers:
    - wrangler.cattle.io/managed-etcd-controller
    - wrangler.cattle.io/node
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/os: linux
      kubernetes.io/arch: amd64
      kubernetes.io/hostname: server
      kubernetes.io/os: linux
      node-role.kubernetes.io/control-plane: "true"
      node-role.kubernetes.io/etcd: "true"
      node-role.kubernetes.io/master: "true"
    name: server
    resourceVersion: "8151"
    uid: 04b6a572-830c-4102-a9a9-15265e4f6a15
  spec:
    podCIDR: 10.42.0.0/24
    podCIDRs:
    - 10.42.0.0/24
  status:
    addresses:
    - address: 172.18.88.17
      type: InternalIP
    - address: server
      type: Hostname
    allocatable:
      cpu: "4"
      ephemeral-storage: "1027046117185"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    capacity:
      cpu: "4"
      ephemeral-storage: 1055762868Ki
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    conditions:
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient PID available
      reason: KubeletHasSufficientPID
      status: "False"
      type: PIDPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:07Z"
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - nvcr.io/nvidia/tensorflow@sha256:7b74f2403f62032db8205cf228052b105bd94f2871e27c1f144c5145e6072984
      - nvcr.io/nvidia/tensorflow:20.03-tf2-py3
      sizeBytes: 7440987700
    - names:
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin@sha256:35ef4e7f7070e9ec0c9d9f9658200ce2dd61b53a436368e8ea45ec02ced78559
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
      sizeBytes: 298298015
    - names:
      - 192.168.0.96:5000/nvidia/k8s-device-plugin@sha256:68fa1607030680a5430ee02cf4fdce040c99436d680ae24ba81ef5bbf4409e8e
      - nvcr.io/nvidia/k8s-device-plugin@sha256:15c4280d13a61df703b12d1fd1b5b5eec4658157db3cb4b851d3259502310136
      - 192.168.0.96:5000/nvidia/k8s-device-plugin:v0.14.1
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.1
      sizeBytes: 298277535
    - names:
      - nvidia/cuda@sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
      - nvidia/cuda:11.6.2-base-ubuntu20.04
      sizeBytes: 153991389
    - names:
      - rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
      - rancher/mirrored-metrics-server:v0.6.2
      sizeBytes: 68892890
    - names:
      - rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61
      - rancher/mirrored-coredns-coredns:1.9.4
      sizeBytes: 49802873
    - names:
      - rancher/local-path-provisioner@sha256:db1a3225290dd8be481a1965fc7040954d0aa0e1f86a77c92816d7c62a02ae5c
      - rancher/local-path-provisioner:v0.0.23
      sizeBytes: 37443889
    - names:
      - rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
      - rancher/mirrored-pause:3.6
      sizeBytes: 682696
    nodeInfo:
      architecture: amd64
      bootID: de2732a0-17d9-4272-a205-7b9ac1103e2b
      containerRuntimeVersion: docker://20.10.25
      kernelVersion: 5.15.90.1-microsoft-standard-WSL2
      kubeProxyVersion: v1.26.3+k3s1
      kubeletVersion: v1.26.3+k3s1
      machineID: 53da58bf9ac14c33847a4b6e1269419b
      operatingSystem: linux
      osImage: Ubuntu 22.04.3 LTS
      systemUUID: 53da58bf9ac14c33847a4b6e1269419b
kind: List
metadata:
  resourceVersion: ""

Tested and documented in qbo with:

Windows 11
WSL2
Docker cgroup v2
Nvidia GPU operator
Kubeflow

https://docs.qbo.io/#/ai_and_ml?id=kubeflow

Thanks to @achim92 contrib and @elezar approval :)

Please note that in Linux default helm chart works in qbo and kind so there is no need for this.

This fix also works for kind kubernetes using accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml

and

 extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

More details see here:

https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275

A couple of notes for the gpu-operator

Labels

Nvidia GPU operator requires a manual label: feature.node.kubernetes.io/pci-10de.present=true for node-feature-discovery to add all necessary labels for the GPU operator to work. This applies only to kind and qbo not sure why k8s requires more labels as indicated here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259

The label can be added as follows:

for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
        kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done

The reson is that WSL2 doesn't contains PCI info under /sys and node-feature-discovery is unable detect the GPU

I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106

I believe node-feature-discovery is expecting something like the output below to build 10de label

lspci -nn |grep -i  nvidia
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] [10de:2560] (rev a1)
0000:01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228e] (rev a1)

I believe the right place to add this label is once the driver has been detected in the host. See here

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

I'll add my comments there.

Docker Image for device-plugin

I built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649010456

git branch
* device-plugin-wsl2

device-plugin docker image

heml chart templates

Docker Image for gpu-operator

I created docker image with changes similar to this

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

gpu-operator docker image

Docker Image for gpu-operator-validator

gpu-operator-validator image

Blogs on how to install: Nvidia GPU Operator + Kubeflow + Docker in Docker + cgroups v2 (In Linux and Windows WSL2)

Blog part 1

Blog part 2

Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows. Can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 working on kubeadm deployed cluster with Driver Version: 551.23 and 2080ti.

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

hi @elezar any update on when the v0.15.0-rc.1 is going to be out?

v0.15.0-rc1 successfully enabled my scenario today: https://github.com/mrjohnsonalexander/classic

TL;DR Stack notes

Python Version 3.11
Docker Community Version 26.1.4
Nvidia Device Plugin v0.15.0-rc.1
Kubernetes v1.30
Containerd 1.6.33
WSL Distribution Centos Stream9
WSL version: 2.2.4.0
OS Version Windows 10 BUILD 19045
Nvidia Game Ready Driver Version: 555.99
Installed Physical Memory (RAM) 32 GB
Nvidia Geforce RTX 4060 Ti 8 GB VRAM
Intel CPU i7-4820k

NVIDIA / k8s-device-plugin