NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
251 stars 47 forks source link

skipping allocation for unsuitable selected node #150

Closed KeKouShi closed 1 month ago

KeKouShi commented 2 months ago

机器重启后,插件就无法使用。 nvidia-dra-k8s-dra-driver-controller日志显示: I0805 07:44:25.860219 1 controller.go:674] "resource controller: pending pod claims" key="schedulingCtx:gpu-test2/pod" claims=[{PodClaimName:shared-gpu Claim:&ResourceClaim{ObjectMeta:{pod-shared-gpu gpu-test2 b6ad3eb0-9f75-4423-afc0-88ae05e66d91 12608805 0 2024-08-05 07:37:24 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod a7453c51-dad9-4948-81ff-898003d9748e 0xc0005fc0c8 0xc0005fc0c9}] [] [{kube-controller-manager Update resource.k8s.io/v1alpha2 2024-08-05 07:37:24 +0000 UTC FieldsV1 {"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"a7453c51-dad9-4948-81ff-898003d9748e\"}":{}}},"f:spec":{"f:allocationMode":{},"f:resourceClassName":{}}} }]},Spec:ResourceClaimSpec{ResourceClassName:gpu.nvidia.com,ParametersRef:nil,AllocationMode:WaitForFirstConsumer,},Status:ResourceClaimStatus{DriverName:,Allocation:nil,ReservedFor:[]ResourceClaimConsumerReference{},DeallocationRequested:false,},} Class:&ResourceClass{ObjectMeta:{gpu.nvidia.com 4ac681d4-a424-4bee-a543-7c396d1b3a50 12604039 0 2024-08-05 06:38:45 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm] map[meta.helm.sh/release-name:nvidia-dra meta.helm.sh/release-namespace:nvidia-dra-driver] [] [] [{helm Update resource.k8s.io/v1alpha2 2024-08-05 06:38:45 +0000 UTC FieldsV1 {"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}} }]},DriverName:gpu.resource.nvidia.com,ParametersRef:nil,SuitableNodes:nil,} ClaimParameters:0xc000592468 ClassParameters:0xc0004a0138 UnsuitableNodes:[machine]}] selectedNode="machine" I0805 07:44:25.860257 1 controller.go:685] "resource controller: skipping allocation for unsuitable selected node" key="schedulingCtx:gpu-test2/pod" node="machine" I0805 07:44:25.860325 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:gpu-test2/pod" I0805 07:44:55.860961 1 controller.go:332] "resource controller: processing" key="schedulingCtx:gpu-test2/pod" 查看kube-scheduler显示: E0805 07:37:25.747771 1 framework.go:1165] "Failed running Reserve plugin" err="waiting for resource driver to allocate resource" plugin="DynamicResources" pod="gpu-test2/pod" E0805 07:37:25.747912 1 schedule_one.go:883] "Error scheduling pod; retrying" err="running Reserve plugin \"DynamicResources\": waiting for resource driver to allocate resource" pod="gpu-test2/pod"

lengrongfu commented 2 months ago

you GPU driver install type what is ? use gpu-operator install or native install.

KeKouShi commented 2 months ago

hi,gpu信息如下: Thu Aug 8 10:10:53 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:03:00.0 Off | Off | | 30% 25C P8 30W / 450W | 21MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1194 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+

klueska commented 2 months ago

What is the question here?

klueska commented 1 month ago

The code has been updated to adhere to the latest DRA APIs in Kubernetes v1.31.

This is a major overhaul of the code base, including a change to have resources now advertised directly to the in-tree scheduler for allocation rather than allocated by a custom controller.

Closing this issue for now, and it seems related to code paths that no longer exist. Please try updating to the latest and let me know if you still have issues.