NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 633 forks source link

Getting GPU device minor number: Not Supported #332

Open zengzhengrong opened 2 years ago

zengzhengrong commented 2 years ago

1. Issue or feature description

helm install nvidia-device-plugin

 helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2 

nvidia-device-plugin-ctr logs

2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "index"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported

goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
    /build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
    /build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
    /build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
    /build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
    /build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
    /build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
    /build/cmd/nvidia-device-plugin/main.go:91 +0x665

When I use ctr to run test gpu is ok

ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction

3. Information to attach (optional if deemed irrelevant)

Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

nvidia-container-cli list  
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so

continaerd config containerd.toml

[plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvdia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
        Runtime = "nvidia-container-runtime"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runtime.v1.linux"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = false

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false
elezar commented 2 years ago

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

zengzhengrong commented 2 years ago

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

All right, I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well

wizpresso-steve-cy-fan commented 2 years ago

@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well

patrykkaj commented 2 years ago

Hi @elezar , in case it's unlikely to appear on the roadmap soon, could you please describe a rough plan of how the support should be added? And whether executing the plan would be doable by outside contributors? Thanks!

elezar commented 2 years ago

@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:

  1. we detect whether this is a WSL2 system (e.g. by checking for the presence of dxcore.so.1)
  2. modify / extend the NVML resource manager to create a device that does not require the device minor number.

Some things to note here:

If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in.

Vinrobot commented 2 years ago

Hello,

I was interested in this, and I adapted the plugin to work. I pushed my version to GitLab (https://gitlab.com/Vinrobot/nvidia-kubernetes-device-plugin/-/tree/features/wsl2) and it works on my machine. I also had to modify NVIDIA/gpu-monitoring-tools (https://github.com/Vinrobot/nvidia-gpu-monitoring-tools/tree/features/wsl2) to also use /dev/dxg.

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

elezar commented 2 years ago

@Vinrobot thanks for the work here. Some thoughts on this:

We recently moved away from nvidia-gpu-monitoring-tools and use bindings from go-nvml through go-nvlib instead.

I think the steps outlined in https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1262288033 should be considered as the starting point. Check if dxcore.so.1 is available and if it is assume a WSL2 system (one could also check for the existence of /dev/dxg here). In this case, create wslDevice that implements the deviceInfo Interface and ensure that this gets instatiated when enumerating devices. This can then return 0 for the minor number and return the correct path.

With regards to the following:

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack.

Vinrobot commented 2 years ago

Hi @elezar, Thanks for the feedback.

I tried to make it work with the most recent version, but I got this error (on the pod)

Warning  UnexpectedAdmissionError  30s   kubelet            Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: unsupported GPU device, which is unexpected

which is caused by this line in gpu-monitoring-tools (still used by gpuallocator).

As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal. Anyway, I will look into it tomorrow.

elezar commented 2 years ago

@Vinrobot yes, it is an issue that gpuallocator still uses gpu-monitoring-tools. It is on our roadmap to port it to the go-nvml bindings, but this is not yet complete.

The issue is the call to get alligned allocation here. (You can confirm this by removing this section).

If this does workd, what we would need is a mechanism to disable this for WSL2 devices.

One option would be to add a AllignedAllocationSupported() bool function to the Devices and Device types. This could look something like:

// AllignedAllocationSupported checks whether all devices support an alligned allocation
func (ds Devices) AllignedAllocationSupported() bool {
    for _, d := range ds {
        if !d.AllignedAllocationSupported() {
            return false
        }
    }
    return true
}

// AllignedAllocationSupported checks whether the device supports an alligned allocation
func (d Device) AllignedAllocationSupported() bool {
    if d.IsMigDevice() {
        return false
    }

    for _, p := range d.Paths {
        if p == "/dev/dgx" {
            return false
        }
    }

    return true
}

(Note that this should still be discussed and could definitely be improved, but would be a good starting point).

achim92 commented 1 year ago

Hi @elezar,

I'm also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291

Would be great to get those changes in.

elezar commented 1 year ago

Thanks @achim92 -- I will have a look at the MR.

Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes.

leon96 commented 1 year ago

hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.
achim92 commented 1 year ago

Thanks @elezar,

would be even better without requiring additional device plugin changes.

I have generated cdi with nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml:

cdiVersion: 0.3.0
containerEdits:
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - create-symlinks
    - --link
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi::/usr/bin/nvidia-smi
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4
    - --folder
    - /usr/lib/wsl/lib
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/lib/libdxcore.so
    hostPath: /usr/lib/wsl/lib/libdxcore.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    options:
    - ro
    - nosuid
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    options:
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dxg
  name: all
kind: nvidia.com/gpu

I also removed NVIDIA Container Runtime hook under /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json. How can I enable CDI to make it work? I'm using cri-o as container runtime, so CDI support should be enabled by default.

I0515 08:39:51.471150       1 main.go:154] Starting FS watcher.
I0515 08:39:51.471416       1 main.go:161] Starting OS watcher.
I0515 08:39:51.472727       1 main.go:176] Starting Plugins.
I0515 08:39:51.472771       1 main.go:234] Loading configuration.
I0515 08:39:51.473017       1 main.go:242] Updating config with default resource matching patterns.
I0515 08:39:51.473350       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 08:39:51.473380       1 main.go:256] Retreiving plugins.
W0515 08:39:51.473833       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0515 08:39:51.474021       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0515 08:39:51.474878       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0515 08:39:51.474918       1 factory.go:115] Incompatible platform detected
E0515 08:39:51.474925       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0515 08:39:51.474930       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0515 08:39:51.474934       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0515 08:39:51.474937       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0515 08:39:51.474946       1 main.go:287] No devices found. Waiting indefinitely.
achim92 commented 1 year ago

@elezar could you please give some guidance here?

NikulausRui commented 1 year ago

hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.

Hi brother, I've encountered the same issue. Have you managed to solve it?

elezar commented 1 year ago

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

davidshen84 commented 1 year ago

Hi @elezar ,

How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template?

Thanks

wizpresso-steve-cy-fan commented 1 year ago

@elezar We are also interested in this

wizpresso-steve-cy-fan commented 1 year ago

I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 would be the right image right?

davidshen84 commented 1 year ago

✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016

WSL environment

WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208

K8S Setup

≥ k3s --version                                                                                                                    
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8

nvidia-smi output in WSL

Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               3W /  40W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

##deleted processes table##

nvidia-device-plugin daemonset pod log

I0725 06:26:03.108417       1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468       1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974       1 main.go:176] Starting Plugins.
I0725 06:26:03.108995       1 main.go:234] Loading configuration.
I0725 06:26:03.109063       1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0725 06:26:03.109219       1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336       1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Test GPU pod output

Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    -fullscreen       (run n-body simulation in fullscreen mode)
    -fp64             (use double precision floating point values for simulation)
    -hostmem          (stores simulation data in host memory)
    -benchmark        (run benchmark to measure performance) 
    -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
    -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    -compare          (compares simulation results running once on the default GPU and once on the CPU)
    -cpu              (run n-body simulation on the CPU)
    -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)

Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !

wizpresso-steve-cy-fan commented 1 year ago

@davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:

    nvidia.com/gpu-driver-upgrade-state: pod-restart-required
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    nvidia.com/gpu.deploy.nvsm: ''
    nvidia.com/gpu.deploy.operands: 'true'
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/device-plugin.config: 'RTX-4070-Ti'

Change device plugin in ClusterPolicy:

  devicePlugin:
    config:
      name: time-slicing-config
    enabled: true
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
    version: 8b416016

It should work for now:


> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction
davidshen84 commented 1 year ago

I created the "runtimeClassName" resource and added the "runtimeClassName" property to the pods.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia

I did not add those properties you mentioned. Why do I need them?

Thanks

On Tue, 25 Jul 2023 at 19:32, wizpresso-steve-cy-fan < @.***> wrote:

@davidshen84 https://github.com/davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready $ touch /run/nvidia/validations/driver-ready $ mkdir -p /run/nvidia/driver/dev $ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:

nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti'

Change device plugin in ClusterPolicy:

devicePlugin: config: name: time-slicing-config enabled: true env:

  • name: PASS_DEVICE_SPECS value: 'true'
  • name: FAIL_ON_INIT_ERROR value: 'true'
  • name: DEVICE_LIST_STRATEGY value: envvar
  • name: DEVICE_ID_STRATEGY value: uuid
  • name: NVIDIA_VISIBLE_DEVICES value: all
  • name: NVIDIA_DRIVER_CAPABILITIES value: all image: k8s-device-plugin imagePullPolicy: IfNotPresent repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging version: 8b416016

It should work for now

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTIBSURYWEGHQ4R5RIDXR6HERANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>

wizpresso-steve-cy-fan commented 1 year ago

@davidshen84 Because I used the gpu-operator for automatic GPU provision

davidshen84 commented 1 year ago

Thanks for the tip!

On Tue, 25 Jul 2023 at 19:46, wizpresso-steve-cy-fan < @.***> wrote:

@davidshen84 https://github.com/davidshen84 Because I used the gpu-operator https://github.com/NVIDIA/gpu-operator for automatic GPU provision

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649487959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTN72GY4JN43V7XZULLXR6IW7ANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>

msclock commented 1 year ago

I verified the staging imageregistry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 that it is truely working on wsl2.

Based on dockerd

Step 1, install k3s cluster based on dockerd

curl -sfL https://get.k3s.io | sh -s - --docker

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: docker
EOF

# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --namespace nvdp \
    --create-namespace \
    --set=runtimeClassName=nvidia \
    --set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
    --set=image.tag=8b416016

Based on containerd

Step 1, install k3s cluster based on containerd

curl -sfL https://get.k3s.io | sh -

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF

# install nvdp with the same steps as above.

Test with nvdp

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

And, the example cuda-sample-vectoradd can work normally.Waiting for the next working release on wsl2😃😃

davidshen84 commented 1 year ago

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub?

guhuajun commented 1 year ago

Hi @elezar,

I can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 is working for me. Even my GPU card is Quadro P1000. :) I can move forward to test Koordiator.

itadmin@server:~/repos/k3s-on-wsl2$ cat /proc/version
Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
itadmin@server:~/repos/k3s-on-wsl2$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Aug 16 06:21:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14   Driver Version: 528.86       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P1000        On   | 00000000:01:00.0  On |                  N/A |
| 34%   39C    P8    N/A /  47W |   1061MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl -n kube-system logs nvidia-device-plugin-daemonset-q642m
I0816 06:20:28.927429       1 main.go:154] Starting FS watcher.
I0816 06:20:28.927534       1 main.go:161] Starting OS watcher.
I0816 06:20:28.927691       1 main.go:176] Starting Plugins.
I0816 06:20:28.927698       1 main.go:234] Loading configuration.
I0816 06:20:28.927762       1 main.go:242] Updating config with default resource matching patterns.
I0816 06:20:28.927936       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0816 06:20:28.927960       1 main.go:256] Retrieving plugins.
I0816 06:20:28.930313       1 factory.go:107] Detected NVML platform: found NVML library
I0816 06:20:28.930362       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0816 06:20:28.947623       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0816 06:20:28.948059       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0816 06:20:28.949737       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      etcd.k3s.cattle.io/node-address: 172.18.88.17
      etcd.k3s.cattle.io/node-name: server-d622491e
      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"52:95:ba:16:e9:29"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 172.18.88.17
      k3s.io/node-args: '["server","--cluster-init","true","--etcd-expose-metrics","true","--disable","traefik","--disable-cloud-controller","true","--docker","true","--kubelet-arg","node-status-update-frequency=4s","--kube-controller-manager-arg","node-monitor-period=2s","--kube-controller-manager-arg","node-monitor-grace-period=16s","--kube-apiserver-arg","default-not-ready-toleration-seconds=20","--kube-apiserver-arg","default-unreachable-toleration-seconds=20","--write-kubeconfig","/home/itadmin/.kube/config","--private-registry","/etc/rancher/k3s/registry.yaml","--flannel-iface","eth0","--bind-address","172.18.88.17","--https-listen-port","6443","--advertise-address","172.18.88.17","--log","/var/log/k3s-server.log"]'
      k3s.io/node-config-hash: IDWWDZRIJO5DHZKGYYHONVZC2DN7TK7THKPSONCFR74ST4LAGNGQ====
      k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c26e7571d760c5f199d18efd197114f1ca4ab1e6ffe494f96feb65c87fcb8cf0"}'
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: "2023-08-16T05:47:03Z"
    finalizers:
    - wrangler.cattle.io/managed-etcd-controller
    - wrangler.cattle.io/node
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/os: linux
      kubernetes.io/arch: amd64
      kubernetes.io/hostname: server
      kubernetes.io/os: linux
      node-role.kubernetes.io/control-plane: "true"
      node-role.kubernetes.io/etcd: "true"
      node-role.kubernetes.io/master: "true"
    name: server
    resourceVersion: "8151"
    uid: 04b6a572-830c-4102-a9a9-15265e4f6a15
  spec:
    podCIDR: 10.42.0.0/24
    podCIDRs:
    - 10.42.0.0/24
  status:
    addresses:
    - address: 172.18.88.17
      type: InternalIP
    - address: server
      type: Hostname
    allocatable:
      cpu: "4"
      ephemeral-storage: "1027046117185"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    capacity:
      cpu: "4"
      ephemeral-storage: 1055762868Ki
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    conditions:
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient PID available
      reason: KubeletHasSufficientPID
      status: "False"
      type: PIDPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:07Z"
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - nvcr.io/nvidia/tensorflow@sha256:7b74f2403f62032db8205cf228052b105bd94f2871e27c1f144c5145e6072984
      - nvcr.io/nvidia/tensorflow:20.03-tf2-py3
      sizeBytes: 7440987700
    - names:
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin@sha256:35ef4e7f7070e9ec0c9d9f9658200ce2dd61b53a436368e8ea45ec02ced78559
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
      sizeBytes: 298298015
    - names:
      - 192.168.0.96:5000/nvidia/k8s-device-plugin@sha256:68fa1607030680a5430ee02cf4fdce040c99436d680ae24ba81ef5bbf4409e8e
      - nvcr.io/nvidia/k8s-device-plugin@sha256:15c4280d13a61df703b12d1fd1b5b5eec4658157db3cb4b851d3259502310136
      - 192.168.0.96:5000/nvidia/k8s-device-plugin:v0.14.1
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.1
      sizeBytes: 298277535
    - names:
      - nvidia/cuda@sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
      - nvidia/cuda:11.6.2-base-ubuntu20.04
      sizeBytes: 153991389
    - names:
      - rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
      - rancher/mirrored-metrics-server:v0.6.2
      sizeBytes: 68892890
    - names:
      - rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61
      - rancher/mirrored-coredns-coredns:1.9.4
      sizeBytes: 49802873
    - names:
      - rancher/local-path-provisioner@sha256:db1a3225290dd8be481a1965fc7040954d0aa0e1f86a77c92816d7c62a02ae5c
      - rancher/local-path-provisioner:v0.0.23
      sizeBytes: 37443889
    - names:
      - rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
      - rancher/mirrored-pause:3.6
      sizeBytes: 682696
    nodeInfo:
      architecture: amd64
      bootID: de2732a0-17d9-4272-a205-7b9ac1103e2b
      containerRuntimeVersion: docker://20.10.25
      kernelVersion: 5.15.90.1-microsoft-standard-WSL2
      kubeProxyVersion: v1.26.3+k3s1
      kubeletVersion: v1.26.3+k3s1
      machineID: 53da58bf9ac14c33847a4b6e1269419b
      operatingSystem: linux
      osImage: Ubuntu 22.04.3 LTS
      systemUUID: 53da58bf9ac14c33847a4b6e1269419b
kind: List
metadata:
  resourceVersion: ""
alexeadem commented 10 months ago

Tested and documented in qbo with:

https://docs.qbo.io/#/ai_and_ml?id=kubeflow

Thanks to @achim92 contrib and @elezar approval :)

Please note that in Linux default helm chart works in qbo and kind so there is no need for this.

This fix also works for kind kubernetes using accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml

and

 extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

More details see here:

https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275

A couple of notes for the gpu-operator

Labels

Nvidia GPU operator requires a manual label: feature.node.kubernetes.io/pci-10de.present=true for node-feature-discovery to add all necessary labels for the GPU operator to work. This applies only to kind and qbo not sure why k8s requires more labels as indicated here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259

The label can be added as follows:

for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
        kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done

The reson is that WSL2 doesn't contains PCI info under /sys and node-feature-discovery is unable detect the GPU

I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106

I believe node-feature-discovery is expecting something like the output below to build 10de label

lspci -nn |grep -i  nvidia
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] [10de:2560] (rev a1)
0000:01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228e] (rev a1)

I believe the right place to add this label is once the driver has been detected in the host. See here

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

I'll add my comments there.

Docker Image for device-plugin

I built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649010456

git branch
* device-plugin-wsl2

device-plugin docker image

heml chart templates

Docker Image for gpu-operator

I created docker image with changes similar to this

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

gpu-operator docker image

Docker Image for gpu-operator-validator

gpu-operator-validator image

Blogs on how to install: Nvidia GPU Operator + Kubeflow + Docker in Docker + cgroups v2 (In Linux and Windows WSL2)

Blog part 1

Blog part 2

pbasov commented 10 months ago

Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows. Can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 working on kubeadm deployed cluster with Driver Version: 551.23 and 2080ti.

elezar commented 10 months ago

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

alexeadem commented 9 months ago

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

hi @elezar any update on when the v0.15.0-rc.1 is going to be out?

mrjohnsonalexander commented 5 months ago

v0.15.0-rc1 successfully enabled my scenario today: https://github.com/mrjohnsonalexander/classic

TL;DR Stack notes