Closed KeKouShi closed 1 month ago
you GPU driver install type what is ? use gpu-operator
install or native install.
hi,gpu信息如下:
Thu Aug 8 10:10:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:03:00.0 Off | Off |
| 30% 25C P8 30W / 450W | 21MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1194 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+
What is the question here?
The code has been updated to adhere to the latest DRA APIs in Kubernetes v1.31.
This is a major overhaul of the code base, including a change to have resources now advertised directly to the in-tree scheduler for allocation rather than allocated by a custom controller.
Closing this issue for now, and it seems related to code paths that no longer exist. Please try updating to the latest and let me know if you still have issues.
机器重启后,插件就无法使用。 nvidia-dra-k8s-dra-driver-controller日志显示:
I0805 07:44:25.860219 1 controller.go:674] "resource controller: pending pod claims" key="schedulingCtx:gpu-test2/pod" claims=[{PodClaimName:shared-gpu Claim:&ResourceClaim{ObjectMeta:{pod-shared-gpu gpu-test2 b6ad3eb0-9f75-4423-afc0-88ae05e66d91 12608805 0 2024-08-05 07:37:24 +0000 UTC <nil> <nil> map[] map[] [{v1 Pod pod a7453c51-dad9-4948-81ff-898003d9748e 0xc0005fc0c8 0xc0005fc0c9}] [] [{kube-controller-manager Update resource.k8s.io/v1alpha2 2024-08-05 07:37:24 +0000 UTC FieldsV1 {"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"a7453c51-dad9-4948-81ff-898003d9748e\"}":{}}},"f:spec":{"f:allocationMode":{},"f:resourceClassName":{}}} }]},Spec:ResourceClaimSpec{ResourceClassName:gpu.nvidia.com,ParametersRef:nil,AllocationMode:WaitForFirstConsumer,},Status:ResourceClaimStatus{DriverName:,Allocation:nil,ReservedFor:[]ResourceClaimConsumerReference{},DeallocationRequested:false,},} Class:&ResourceClass{ObjectMeta:{gpu.nvidia.com 4ac681d4-a424-4bee-a543-7c396d1b3a50 12604039 0 2024-08-05 06:38:45 +0000 UTC <nil> <nil> map[app.kubernetes.io/managed-by:Helm] map[meta.helm.sh/release-name:nvidia-dra meta.helm.sh/release-namespace:nvidia-dra-driver] [] [] [{helm Update resource.k8s.io/v1alpha2 2024-08-05 06:38:45 +0000 UTC FieldsV1 {"f:driverName":{},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}}} }]},DriverName:gpu.resource.nvidia.com,ParametersRef:nil,SuitableNodes:nil,} ClaimParameters:0xc000592468 ClassParameters:0xc0004a0138 UnsuitableNodes:[machine]}] selectedNode="machine" I0805 07:44:25.860257 1 controller.go:685] "resource controller: skipping allocation for unsuitable selected node" key="schedulingCtx:gpu-test2/pod" node="machine" I0805 07:44:25.860325 1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:gpu-test2/pod" I0805 07:44:55.860961 1 controller.go:332] "resource controller: processing" key="schedulingCtx:gpu-test2/pod"
查看kube-scheduler显示:E0805 07:37:25.747771 1 framework.go:1165] "Failed running Reserve plugin" err="waiting for resource driver to allocate resource" plugin="DynamicResources" pod="gpu-test2/pod" E0805 07:37:25.747912 1 schedule_one.go:883] "Error scheduling pod; retrying" err="running Reserve plugin \"DynamicResources\": waiting for resource driver to allocate resource" pod="gpu-test2/pod"