NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.69k stars 610 forks source link

Multiple device types detected: #736

Open sipvoip opened 10 months ago

sipvoip commented 10 months ago

[root@gpu-feature-discovery-sjzg4 /]# gpu-feature-discovery mixed I1022 16:36:51.576911 137 main.go:122] Starting OS watcher. I1022 16:36:51.577238 137 main.go:127] Loading configuration. I1022 16:36:51.577541 137 main.go:139] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "gdsEnabled": null, "mofedEnabled": null, "gfd": { "oneshot": false, "noTimestamp": false, "sleepInterval": "1m0s", "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd", "machineTypeFile": "/sys/class/dmi/id/product_name" } }, "resources": { "gpus": null }, "sharing": { "timeSlicing": {} } } I1022 16:36:51.577912 137 factory.go:48] Detected NVML platform: found NVML library I1022 16:36:51.577942 137 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1022 16:36:51.577952 137 factory.go:64] Using NVML manager I1022 16:36:51.577959 137 main.go:144] Start running W1022 16:36:51.602083 137 mig-strategy.go:151] Multiple device types detected: [NVIDIA GeForce RTX 3080 NVIDIA GeForce RTX 3090 NVIDIA GeForce RTX 4090] I1022 16:36:51.606246 137 main.go:187] Creating Labels 2023/10/22 16:36:51 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd I1022 16:36:51.606418 137 main.go:197] Sleeping for 60000000000

Only the last GPU is showing up, the 4090. root@kubernetes0: more /etc/kubernetes/node-feature-discovery/features.d/gfd nvidia.com/gpu.compute.major=8 nvidia.com/gpu.count=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=Standard-PC-(i440FX-+-PIIX,-1996) nvidia.com/cuda.driver.minor=113 nvidia.com/gfd.timestamp=1697987627 nvidia.com/gpu.replicas=1 nvidia.com/gpu.memory=24564 nvidia.com/cuda.runtime.minor=2 nvidia.com/cuda.driver.rev=01 nvidia.com/mig.capable=false nvidia.com/gpu.product=NVIDIA-GeForce-RTX-4090 nvidia.com/gpu.compute.minor=9 nvidia.com/cuda.driver.major=535 nvidia.com/cuda.runtime.major=12

How do I get all 3 GPUs to be discovered?

deanpeterson commented 7 months ago

I'm getting the same issue. I have a 3060 and a 3090 in one of my nodes and only the 3090 is showing even though lspci shows both. Have you figured out a workaround yet?

elezar commented 7 months ago

The labels exposed by GFD are node-level labels and we don't have the granularity to map these to common nvidia.com/gpu labels at present.

@deanpeterson and @sipvoip what is it that you're trying to do with the labels?

deanpeterson commented 7 months ago

@elezar I'm using OpenShift AI with the Ray.io components to create distributed workloads. I have one dual 4090 machine that sees both video cards because they are the same. But I have another node that has a 3060 and a 3090. When I spin up ray.io workers they have to match. So if I say spin up workers with 2 gpus, then to have 2 workers I have to have both the 3060 and 3090 be recognized by the nvidia gpu operator. This was working on my epyc machine. But that node was unstable so I replaced it with a dual xeon machine. For some reason, the dual xeon machine sees both video cards but the nvidia gpu operator is not creating labels for both the 3060 and 3090 and shows nvidia.com/gpu.count=1 even though the gpu discovery pod shows this:

W0213 05:32:27.592511 1 mig-strategy.go:151] Multiple device types detected: [NVIDIA GeForce RTX 3060 NVIDIA GeForce RTX 3090] 68 I0213 05:32:27.600682 1 main.go:187] Creating Labels

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

ThomasDravigney commented 5 days ago

Hi, any update on that? I'm having the exact same issue. Thanks!