Open zengzhengrong opened 2 years ago
You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.
You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.
All right, I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well
@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well
Hi @elezar , in case it's unlikely to appear on the roadmap soon, could you please describe a rough plan of how the support should be added? And whether executing the plan would be doable by outside contributors? Thanks!
@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:
dxcore.so.1
)Some things to note here:
/dev/dxg
and not /dev/nvidia*
.If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in.
Hello,
I was interested in this, and I adapted the plugin to work.
I pushed my version to GitLab (https://gitlab.com/Vinrobot/nvidia-kubernetes-device-plugin/-/tree/features/wsl2) and it works on my machine.
I also had to modify NVIDIA/gpu-monitoring-tools (https://github.com/Vinrobot/nvidia-gpu-monitoring-tools/tree/features/wsl2) to also use /dev/dxg
.
I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg
is a NVIDIA GPU or an incompatible device, does someone have a good idea?
@Vinrobot thanks for the work here. Some thoughts on this:
We recently moved away from nvidia-gpu-monitoring-tools
and use bindings from go-nvml
through go-nvlib
instead.
I think the steps outlined in https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1262288033 should be considered as the starting point. Check if dxcore.so.1
is available and if it is assume a WSL2 system (one could also check for the existence of /dev/dxg
here). In this case, create wslDevice
that implements the deviceInfo
Interface and ensure that this gets instatiated when enumerating devices. This can then return 0
for the minor number and return the correct path.
With regards to the following:
I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?
I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack.
Hi @elezar, Thanks for the feedback.
I tried to make it work with the most recent version, but I got this error (on the pod)
Warning UnexpectedAdmissionError 30s kubelet Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: unsupported GPU device, which is unexpected
which is caused by this line in gpu-monitoring-tools (still used by gpuallocator).
As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal. Anyway, I will look into it tomorrow.
@Vinrobot yes, it is an issue that gpuallocator
still uses gpu-monitoring-tools
. It is on our roadmap to port it to the go-nvml
bindings, but this is not yet complete.
The issue is the call to get alligned allocation here. (You can confirm this by removing this section).
If this does workd, what we would need is a mechanism to disable this for WSL2 devices.
One option would be to add a AllignedAllocationSupported() bool
function to the Devices
and Device
types. This could look something like:
// AllignedAllocationSupported checks whether all devices support an alligned allocation
func (ds Devices) AllignedAllocationSupported() bool {
for _, d := range ds {
if !d.AllignedAllocationSupported() {
return false
}
}
return true
}
// AllignedAllocationSupported checks whether the device supports an alligned allocation
func (d Device) AllignedAllocationSupported() bool {
if d.IsMigDevice() {
return false
}
for _, p := range d.Paths {
if p == "/dev/dgx" {
return false
}
}
return true
}
(Note that this should still be discussed and could definitely be improved, but would be a good starting point).
Hi @elezar,
I'm also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291
Would be great to get those changes in.
Thanks @achim92 -- I will have a look at the MR.
Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes.
hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."
I0515 07:23:12.247146 1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248 1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352 1 main.go:176] Starting Plugins.
I0515 07:23:12.248389 1 main.go:234] Loading configuration.
I0515 07:23:12.248530 1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0515 07:23:12.248816 1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257 1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094 1 main.go:287] No devices found. Waiting indefinitely.
Thanks @elezar,
would be even better without requiring additional device plugin changes.
I have generated cdi with nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
:
cdiVersion: 0.3.0
containerEdits:
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi::/usr/bin/nvidia-smi
hookName: createContainer
path: /usr/bin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- update-ldcache
- --folder
- /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4
- --folder
- /usr/lib/wsl/lib
hookName: createContainer
path: /usr/bin/nvidia-ctk
mounts:
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/lib/libdxcore.so
hostPath: /usr/lib/wsl/lib/libdxcore.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
options:
- ro
- nosuid
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
options:
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
options:
- ro
- nosuid
- nodev
- bind
devices:
- containerEdits:
deviceNodes:
- path: /dev/dxg
name: all
kind: nvidia.com/gpu
I also removed NVIDIA Container Runtime hook under /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
.
How can I enable CDI to make it work? I'm using cri-o as container runtime, so CDI support should be enabled by default.
I0515 08:39:51.471150 1 main.go:154] Starting FS watcher.
I0515 08:39:51.471416 1 main.go:161] Starting OS watcher.
I0515 08:39:51.472727 1 main.go:176] Starting Plugins.
I0515 08:39:51.472771 1 main.go:234] Loading configuration.
I0515 08:39:51.473017 1 main.go:242] Updating config with default resource matching patterns.
I0515 08:39:51.473350 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0515 08:39:51.473380 1 main.go:256] Retreiving plugins.
W0515 08:39:51.473833 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0515 08:39:51.474021 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0515 08:39:51.474878 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0515 08:39:51.474918 1 factory.go:115] Incompatible platform detected
E0515 08:39:51.474925 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0515 08:39:51.474930 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0515 08:39:51.474934 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0515 08:39:51.474937 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0515 08:39:51.474946 1 main.go:287] No devices found. Waiting indefinitely.
@elezar could you please give some guidance here?
hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."
I0515 07:23:12.247146 1 main.go:154] Starting FS watcher. I0515 07:23:12.247248 1 main.go:161] Starting OS watcher. I0515 07:23:12.248352 1 main.go:176] Starting Plugins. I0515 07:23:12.248389 1 main.go:234] Loading configuration. I0515 07:23:12.248530 1 main.go:242] Updating config with default resource matching patterns. I0515 07:23:12.248786 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0515 07:23:12.248816 1 main.go:256] Retreiving plugins. I0515 07:23:12.251257 1 factory.go:107] Detected NVML platform: found NVML library I0515 07:23:12.251330 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0515 07:23:12.270094 1 main.go:287] No devices found. Waiting indefinitely.
Hi brother, I've encountered the same issue. Have you managed to solve it?
Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.
Hi @elezar ,
How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template?
Thanks
@elezar We are also interested in this
I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
would be the right image right?
✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208
≥ k3s --version
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8
Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04 Driver Version: 536.25 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A2000 8GB Lap... On | 00000000:01:00.0 Off | N/A |
| N/A 46C P8 3W / 40W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
##deleted processes table##
I0725 06:26:03.108417 1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468 1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974 1 main.go:176] Starting Plugins.
I0725 06:26:03.108995 1 main.go:234] Loading configuration.
I0725 06:26:03.109063 1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0725 06:26:03.109219 1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336 1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6
> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)
Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !
@davidshen84 I can also confirm it works. However, we have to add some additional stuff:
$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg
Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required
nvidia.com/gpu.count: '1'
nvidia.com/gpu.deploy.container-toolkit: 'true'
nvidia.com/gpu.deploy.dcgm: 'true'
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
nvidia.com/gpu.deploy.device-plugin: 'true'
nvidia.com/gpu.deploy.driver: 'true'
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
nvidia.com/gpu.deploy.node-status-exporter: 'true'
nvidia.com/gpu.deploy.nvsm: ''
nvidia.com/gpu.deploy.operands: 'true'
nvidia.com/gpu.deploy.operator-validator: 'true'
nvidia.com/gpu.present: 'true'
nvidia.com/device-plugin.config: 'RTX-4070-Ti'
Change device plugin in ClusterPolicy:
devicePlugin:
config:
name: time-slicing-config
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
version: 8b416016
It should work for now:
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9
> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction
I created the "runtimeClassName" resource and added the "runtimeClassName" property to the pods.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
I did not add those properties you mentioned. Why do I need them?
Thanks
On Tue, 25 Jul 2023 at 19:32, wizpresso-steve-cy-fan < @.***> wrote:
@davidshen84 https://github.com/davidshen84 I can also confirm it works. However, we have to add some additional stuff:
$ touch /run/nvidia/validations/toolkit-ready $ touch /run/nvidia/validations/driver-ready $ mkdir -p /run/nvidia/driver/dev $ ln -s /run/nvidia/driver/dev/dxg /dev/dxg
Annotate the WSL node:
nvidia.com/gpu-driver-upgrade-state: pod-restart-required nvidia.com/gpu.count: '1' nvidia.com/gpu.deploy.container-toolkit: 'true' nvidia.com/gpu.deploy.dcgm: 'true' nvidia.com/gpu.deploy.dcgm-exporter: 'true' nvidia.com/gpu.deploy.device-plugin: 'true' nvidia.com/gpu.deploy.driver: 'true' nvidia.com/gpu.deploy.gpu-feature-discovery: 'true' nvidia.com/gpu.deploy.node-status-exporter: 'true' nvidia.com/gpu.deploy.nvsm: '' nvidia.com/gpu.deploy.operands: 'true' nvidia.com/gpu.deploy.operator-validator: 'true' nvidia.com/gpu.present: 'true' nvidia.com/device-plugin.config: 'RTX-4070-Ti'
Change device plugin in ClusterPolicy:
devicePlugin: config: name: time-slicing-config enabled: true env:
- name: PASS_DEVICE_SPECS value: 'true'
- name: FAIL_ON_INIT_ERROR value: 'true'
- name: DEVICE_LIST_STRATEGY value: envvar
- name: DEVICE_ID_STRATEGY value: uuid
- name: NVIDIA_VISIBLE_DEVICES value: all
- name: NVIDIA_DRIVER_CAPABILITIES value: all image: k8s-device-plugin imagePullPolicy: IfNotPresent repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging version: 8b416016
It should work for now
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTIBSURYWEGHQ4R5RIDXR6HERANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>
@davidshen84 Because I used the gpu-operator for automatic GPU provision
Thanks for the tip!
On Tue, 25 Jul 2023 at 19:46, wizpresso-steve-cy-fan < @.***> wrote:
@davidshen84 https://github.com/davidshen84 Because I used the gpu-operator https://github.com/NVIDIA/gpu-operator for automatic GPU provision
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649487959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTN72GY4JN43V7XZULLXR6IW7ANCNFSM6AAAAAAQF6PUHY . You are receiving this because you were mentioned.Message ID: @.***>
I verified the staging imageregistry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
that it is truely working on wsl2.
Step 1, install k3s cluster based on dockerd
curl -sfL https://get.k3s.io | sh -s - --docker
Step 2, install dp with the staging image.
# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: docker
EOF
# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvdp \
--create-namespace \
--set=runtimeClassName=nvidia \
--set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
--set=image.tag=8b416016
Step 1, install k3s cluster based on containerd
curl -sfL https://get.k3s.io | sh -
Step 2, install dp with the staging image.
# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF
# install nvdp with the same steps as above.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
And, the example cuda-sample-vectoradd
can work normally.Waiting for the next working release on wsl2😃😃
Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.
Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub?
Hi @elezar,
I can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
is working for me. Even my GPU card is Quadro P1000
. :) I can move forward to test Koordiator.
itadmin@server:~/repos/k3s-on-wsl2$ cat /proc/version
Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
itadmin@server:~/repos/k3s-on-wsl2$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Aug 16 06:21:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14 Driver Version: 528.86 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P1000 On | 00000000:01:00.0 On | N/A |
| 34% 39C P8 N/A / 47W | 1061MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl -n kube-system logs nvidia-device-plugin-daemonset-q642m
I0816 06:20:28.927429 1 main.go:154] Starting FS watcher.
I0816 06:20:28.927534 1 main.go:161] Starting OS watcher.
I0816 06:20:28.927691 1 main.go:176] Starting Plugins.
I0816 06:20:28.927698 1 main.go:234] Loading configuration.
I0816 06:20:28.927762 1 main.go:242] Updating config with default resource matching patterns.
I0816 06:20:28.927936 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0816 06:20:28.927960 1 main.go:256] Retrieving plugins.
I0816 06:20:28.930313 1 factory.go:107] Detected NVML platform: found NVML library
I0816 06:20:28.930362 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0816 06:20:28.947623 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0816 06:20:28.948059 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0816 06:20:28.949737 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
annotations:
etcd.k3s.cattle.io/node-address: 172.18.88.17
etcd.k3s.cattle.io/node-name: server-d622491e
flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"52:95:ba:16:e9:29"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 172.18.88.17
k3s.io/node-args: '["server","--cluster-init","true","--etcd-expose-metrics","true","--disable","traefik","--disable-cloud-controller","true","--docker","true","--kubelet-arg","node-status-update-frequency=4s","--kube-controller-manager-arg","node-monitor-period=2s","--kube-controller-manager-arg","node-monitor-grace-period=16s","--kube-apiserver-arg","default-not-ready-toleration-seconds=20","--kube-apiserver-arg","default-unreachable-toleration-seconds=20","--write-kubeconfig","/home/itadmin/.kube/config","--private-registry","/etc/rancher/k3s/registry.yaml","--flannel-iface","eth0","--bind-address","172.18.88.17","--https-listen-port","6443","--advertise-address","172.18.88.17","--log","/var/log/k3s-server.log"]'
k3s.io/node-config-hash: IDWWDZRIJO5DHZKGYYHONVZC2DN7TK7THKPSONCFR74ST4LAGNGQ====
k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c26e7571d760c5f199d18efd197114f1ca4ab1e6ffe494f96feb65c87fcb8cf0"}'
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2023-08-16T05:47:03Z"
finalizers:
- wrangler.cattle.io/managed-etcd-controller
- wrangler.cattle.io/node
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: server
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: "true"
node-role.kubernetes.io/etcd: "true"
node-role.kubernetes.io/master: "true"
name: server
resourceVersion: "8151"
uid: 04b6a572-830c-4102-a9a9-15265e4f6a15
spec:
podCIDR: 10.42.0.0/24
podCIDRs:
- 10.42.0.0/24
status:
addresses:
- address: 172.18.88.17
type: InternalIP
- address: server
type: Hostname
allocatable:
cpu: "4"
ephemeral-storage: "1027046117185"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32760580Ki
nvidia.com/gpu: "1"
pods: "110"
capacity:
cpu: "4"
ephemeral-storage: 1055762868Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32760580Ki
nvidia.com/gpu: "1"
pods: "110"
conditions:
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:03Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2023-08-16T06:20:34Z"
lastTransitionTime: "2023-08-16T05:47:07Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- nvcr.io/nvidia/tensorflow@sha256:7b74f2403f62032db8205cf228052b105bd94f2871e27c1f144c5145e6072984
- nvcr.io/nvidia/tensorflow:20.03-tf2-py3
sizeBytes: 7440987700
- names:
- 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin@sha256:35ef4e7f7070e9ec0c9d9f9658200ce2dd61b53a436368e8ea45ec02ced78559
- 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
sizeBytes: 298298015
- names:
- 192.168.0.96:5000/nvidia/k8s-device-plugin@sha256:68fa1607030680a5430ee02cf4fdce040c99436d680ae24ba81ef5bbf4409e8e
- nvcr.io/nvidia/k8s-device-plugin@sha256:15c4280d13a61df703b12d1fd1b5b5eec4658157db3cb4b851d3259502310136
- 192.168.0.96:5000/nvidia/k8s-device-plugin:v0.14.1
- nvcr.io/nvidia/k8s-device-plugin:v0.14.1
sizeBytes: 298277535
- names:
- nvidia/cuda@sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
- nvidia/cuda:11.6.2-base-ubuntu20.04
sizeBytes: 153991389
- names:
- rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
- rancher/mirrored-metrics-server:v0.6.2
sizeBytes: 68892890
- names:
- rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61
- rancher/mirrored-coredns-coredns:1.9.4
sizeBytes: 49802873
- names:
- rancher/local-path-provisioner@sha256:db1a3225290dd8be481a1965fc7040954d0aa0e1f86a77c92816d7c62a02ae5c
- rancher/local-path-provisioner:v0.0.23
sizeBytes: 37443889
- names:
- rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
- rancher/mirrored-pause:3.6
sizeBytes: 682696
nodeInfo:
architecture: amd64
bootID: de2732a0-17d9-4272-a205-7b9ac1103e2b
containerRuntimeVersion: docker://20.10.25
kernelVersion: 5.15.90.1-microsoft-standard-WSL2
kubeProxyVersion: v1.26.3+k3s1
kubeletVersion: v1.26.3+k3s1
machineID: 53da58bf9ac14c33847a4b6e1269419b
operatingSystem: linux
osImage: Ubuntu 22.04.3 LTS
systemUUID: 53da58bf9ac14c33847a4b6e1269419b
kind: List
metadata:
resourceVersion: ""
Tested and documented in qbo with:
https://docs.qbo.io/#/ai_and_ml?id=kubeflow
Thanks to @achim92 contrib and @elezar approval :)
Please note that in Linux default helm chart works in
qbo
andkind
so there is no need for this.
This fix also works for kind kubernetes using accept-nvidia-visible-devices-as-volume-mounts = true
in /etc/nvidia-container-runtime/config.toml
and
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
More details see here:
https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
Nvidia GPU operator requires a manual label: feature.node.kubernetes.io/pci-10de.present=true
for node-feature-discovery to add all necessary labels for the GPU operator to work. This applies only to kind
and qbo
not sure why k8s
requires more labels as indicated here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649468259
The label can be added as follows:
for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done
The reson is that WSL2 doesn't contains PCI info under /sys
and node-feature-discovery is unable detect the GPU
I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106
I believe node-feature-discovery
is expecting something like the output below to build 10de
label
lspci -nn |grep -i nvidia
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] [10de:2560] (rev a1)
0000:01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228e] (rev a1)
I believe the right place to add this label is once the driver has been detected in the host. See here
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs
I'll add my comments there.
I built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1649010456
git branch
* device-plugin-wsl2
I created docker image with changes similar to this
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs
Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows.
Can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
working on kubeadm
deployed cluster with Driver Version: 551.23
and 2080ti.
Just a general note: We will release a v0.15.0-rc.1
of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.
Just a general note: We will release a
v0.15.0-rc.1
of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.
hi @elezar any update on when the v0.15.0-rc.1 is going to be out?
v0.15.0-rc1 successfully enabled my scenario today: https://github.com/mrjohnsonalexander/classic
TL;DR Stack notes
1. Issue or feature description
helm install nvidia-device-plugin
nvidia-device-plugin-ctr logs
When I use ctr to run test gpu is ok
3. Information to attach (optional if deemed irrelevant)
Common error checking:
[ ] The output of
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
continaerd config containerd.toml