Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
667 stars 153 forks source link

调度失败,查看日志显示“Node lock not set” #338

Closed lianziqt closed 3 months ago

lianziqt commented 3 months ago

1. Issue or feature description

想请教下一些HAMi使用上的问题,目前我在带有GPU的节点安装HAMi后,想拉起一个deployment时,一直失败,pod的状态显示UnexpectdAdmissionError,使用 kubectl logs -f -n kube-system hami-device-plugin-tlxs8 -c device-plugin查看hami插件的日志显示Node lock not set,具体如下

I0529 12:23:21.541285 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258
I0529 12:23:21.584937 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU0"," X ","SYS","0-55","N/A"] index=0
I0529 12:23:21.584960 3260597 register.go:159] nvml registered device id=1, memory=24258, type=NVIDIA A30, numa=0
I0529 12:23:21.585025 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258
I0529 12:23:21.632135 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU1","SYS"," X ","0-55","N/A"] index=1
I0529 12:23:21.632153 3260597 register.go:159] nvml registered device id=2, memory=24258, type=NVIDIA A30, numa=0
I0529 12:23:21.632165 3260597 register.go:166] "start working on the devices" devices=[{"Index":0,"Id":"GPU-14637886-1d37-61c4-a71e-fb5ca495ef65","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true},{"Index":0,"Id":"GPU-4edca201-9195-cf92-98c7-e7c2d627742b","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true}]
I0529 12:23:21.635578 3260597 util.go:128] Encoded node Devices: GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true:
I0529 12:23:21.635602 3260597 register.go:176] patch node with the following annos map[hami.io/node-handshake:Reported 2024-05-29 12:23:21.635589358 +0000 UTC m=+6684.633978899 hami.io/node-nvidia-register:GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true:]
I0529 12:23:21.645169 3260597 register.go:196] Successfully registered annotation. Next check in 30s seconds...
I0529 12:23:21.879271 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-4],}]
I0529 12:23:22.013396 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:22.479542 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-2],}]
I0529 12:23:22.615092 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:23.083824 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-2],}]
I0529 12:23:23.219052 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:23.679685 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-9],}]
I0529 12:23:23.814882 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:24.279823 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-2],}]
I0529 12:23:24.419197 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:24.879061 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-7],}]
I0529 12:23:25.017744 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"
I0529 12:23:25.479292 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-3],}]
I0529 12:23:25.621761 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"

2. Steps to reproduce the issue

  1. 参照主页文档的安装教程,知道通过kubectl get pods -n kube-system,看到
    NAME                                       READY   STATUS    RESTARTS   AGE
    (......)
    hami-device-plugin-tlxs8                   2/2     Running   0          47h
    hami-scheduler-68cd4d985b-jrkh5            2/2     Running   0          47h
    (......)

    使用kubectl decribe pods 192.168.0.5 也能看到GPU数量符合预期(2 * 10),192.168.0.5为节点名

    
    Capacity:
    cpu:                56
    ephemeral-storage:  154502324Ki
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             482970120Ki
    nvidia.com/gpu:     20
    pods:               110
    Allocatable:
    cpu:                54
    ephemeral-storage:  140241857915
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             479312392Ki
    nvidia.com/gpu:     20
    pods:               110
2. 部署一个deployment(nvidia.com/gpu:1)后,一直失败,pod的状态显示UnexpectdAdmissionError,使用 kubectl logs -f -n kube-system hami-device-plugin-tlxs8 -c device-plugin查看hami插件的日志显示Node lock not set,具体如下

I0531 09:46:29.283565 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258 I0531 09:46:29.325813 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU0"," X ","SYS","0-55","N/A"] index=0 I0531 09:46:29.325865 3260597 register.go:159] nvml registered device id=1, memory=24258, type=NVIDIA A30, numa=0 I0531 09:46:29.326031 3260597 register.go:131] MemoryScaling= 1 registeredmem= 24258 I0531 09:46:29.336827 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:29.373783 3260597 register.go:83] "current card has not established numa topology" gpu row info=["GPU1","SYS"," X ","0-55","N/A"] index=1 I0531 09:46:29.373804 3260597 register.go:159] nvml registered device id=2, memory=24258, type=NVIDIA A30, numa=0 I0531 09:46:29.373819 3260597 register.go:166] "start working on the devices" devices=[{"Index":0,"Id":"GPU-14637886-1d37-61c4-a71e-fb5ca495ef65","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true},{"Index":0,"Id":"GPU-4edca201-9195-cf92-98c7-e7c2d627742b","Count":10,"Devmem":24258,"Devcore":100,"Type":"NVIDIA-NVIDIA A30","Numa":0,"Health":true}] I0531 09:46:29.376915 3260597 util.go:128] Encoded node Devices: GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true: I0531 09:46:29.376943 3260597 register.go:176] patch node with the following annos map[hami.io/node-handshake:Reported 2024-05-31 09:46:29.376930027 +0000 UTC m=+170072.375319578 hami.io/node-nvidia-register:GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true:] I0531 09:46:29.383974 3260597 register.go:196] Successfully registered annotation. Next check in 30s seconds... I0531 09:46:29.794603 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-3],}] I0531 09:46:29.924314 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:30.397174 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-8],}] I0531 09:46:30.534171 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:30.996050 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-4],}] I0531 09:46:31.128008 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:31.596925 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-9],}] I0531 09:46:31.728817 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:32.195546 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-4],}] I0531 09:46:32.327801 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:32.797415 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-6],}] I0531 09:46:32.929690 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:33.395637 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-4],}] I0531 09:46:33.533039 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:33.995417 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-7],}] I0531 09:46:34.128932 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:34.796967 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-8],}] I0531 09:46:34.929314 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:35.398659 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-14637886-1d37-61c4-a71e-fb5ca495ef65-1],}] I0531 09:46:35.534839 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:35.995026 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-5],}] I0531 09:46:36.133031 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5" I0531 09:46:36.996326 3260597 server.go:290] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-4edca201-9195-cf92-98c7-e7c2d627742b-1],}] I0531 09:46:37.132472 3260597 nodelock.go:73] "Node lock not set" node="192.168.0.5"

使用`kubectl logs -f -n kube-system hami-scheduler-68cd4d985b-jrkh5 vgpu-scheduler-extender`,看到日志如下

I0531 09:44:59.082870 1 util.go:128] Encoded node Devices: GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDIA-NVIDIA A30,0,true: W0531 09:44:59.082892 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:44:59.082898 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:44:59.082902 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found W0531 09:44:59.083412 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:44:59.083426 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:44:59.083430 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found I0531 09:45:00.124588 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:00.125001 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 85ab3c40-28fa-4585-b63c-0fce5b15f15d I0531 09:45:00.125016 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 85ab3c40-28fa-4585-b63c-0fce5b15f15d - Allowing admission for pod: no resource found I0531 09:45:00.818481 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:00.818883 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 5f7610c6-8cf4-4fa1-8e99-3ea18af2a3ef I0531 09:45:00.818899 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 5f7610c6-8cf4-4fa1-8e99-3ea18af2a3ef - Allowing admission for pod: no resource found I0531 09:45:01.672039 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:01.672443 1 webhook.go:63] Processing admission hook for pod datarangers/, UID: c1df9597-831e-443e-b03b-d4e8ac4a44ea I0531 09:45:01.672489 1 webhook.go:84] Processing admission hook for pod datarangers/, UID: c1df9597-831e-443e-b03b-d4e8ac4a44ea - Allowing admission for pod: no resource found I0531 09:45:01.917897 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:01.918296 1 webhook.go:63] Processing admission hook for pod datarangers/, UID: 5c8aad5d-0233-4f23-b700-43eca87a7b96 I0531 09:45:01.918312 1 webhook.go:84] Processing admission hook for pod datarangers/, UID: 5c8aad5d-0233-4f23-b700-43eca87a7b96 - Allowing admission for pod: no resource found I0531 09:45:02.772161 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:02.772517 1 webhook.go:63] Processing admission hook for pod minio/, UID: 586170e7-8fd5-4050-8967-944b1ac4a674 I0531 09:45:02.772533 1 webhook.go:84] Processing admission hook for pod minio/, UID: 586170e7-8fd5-4050-8967-944b1ac4a674 - Allowing admission for pod: no resource found I0531 09:45:03.728881 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:03.729449 1 webhook.go:63] Processing admission hook for pod volc-content-bigdata/, UID: e11888b8-5b9f-47c4-b10f-5eae51b5a092 I0531 09:45:03.729465 1 webhook.go:84] Processing admission hook for pod volc-content-bigdata/, UID: e11888b8-5b9f-47c4-b10f-5eae51b5a092 - Allowing admission for pod: no resource found W0531 09:45:12.202457 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found W0531 09:45:12.202482 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:45:12.202486 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found I0531 09:45:14.218465 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.218859 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: f346beef-d152-4d32-8f27-4270bd1dc722 I0531 09:45:14.218874 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: f346beef-d152-4d32-8f27-4270bd1dc722 - Allowing admission for pod: no resource found I0531 09:45:14.234467 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.234804 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 443242f6-5a7c-49a6-9743-6d22c2595ada I0531 09:45:14.234820 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 443242f6-5a7c-49a6-9743-6d22c2595ada - Allowing admission for pod: no resource found I0531 09:45:14.372544 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.372916 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 0651ce95-f174-4461-ac10-4ef7ec39363b I0531 09:45:14.372932 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 0651ce95-f174-4461-ac10-4ef7ec39363b - Allowing admission for pod: no resource found I0531 09:45:14.722750 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:14.723151 1 webhook.go:63] Processing admission hook for pod datafinder/, UID: 977422db-afec-49b3-81b8-0435048c5b0f I0531 09:45:14.723165 1 webhook.go:84] Processing admission hook for pod datafinder/, UID: 977422db-afec-49b3-81b8-0435048c5b0f - Allowing admission for pod: no resource found W0531 09:45:27.202206 1 scheduler.go:325] get node 192.168.0.238 device error, node 192.168.0.238 not found W0531 09:45:27.202223 1 scheduler.go:325] get node 192.168.0.237 device error, node 192.168.0.237 not found W0531 09:45:27.202228 1 scheduler.go:325] get node 192.168.0.32 device error, node 192.168.0.32 not found I0531 09:45:27.759291 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:27.759805 1 webhook.go:63] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 W0531 09:45:27.759818 1 webhook.go:69] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 - Denying admission as container pdf-model-deploy-main is privileged I0531 09:45:27.759823 1 webhook.go:84] Processing admission hook for pod volc-content-mng/, UID: 21941509-f7f1-4f86-90c8-6545d6af79f0 - Allowing admission for pod: no resource found I0531 09:45:27.885142 1 route.go:131] Start to handle webhook request on /webhook I0531 09:45:27.885640 1 webhook.go:63] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 W0531 09:45:27.885652 1 webhook.go:69] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 - Denying admission as container pdf-model-deploy-main is privileged I0531 09:45:27.885656 1 webhook.go:84] Processing admission hook for pod volc-content-mng/, UID: 9081502d-73ca-4b9e-bd35-20f60632fe55 - Allowing admission for pod: no resource found

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

Common error checking:
The output of `nvidia-smi -a` on your host

==============NVSMI LOG==============

Timestamp : Fri May 31 17:54:34 2024 Driver Version : 470.57.02 CUDA Version : 11.4

Attached GPUs : 2 GPU 00000000:65:01.0 Product Name : NVIDIA A30 Product Brand : NVIDIA Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1324021108933 GPU UUID : GPU-14637886-1d37-61c4-a71e-fb5ca495ef65 Minor Number : 1 VBIOS Version : 92.00.66.00.03 MultiGPU Board : No Board ID : 0x6501 GPU Part Number : 900-21001-0040-000 Module ID : 0 Inforom Version Image Version : 1001.0205.00.02 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x65 Device : 0x01 Domain : 0x0000 Device Id : 0x20B710DE Bus Id : 00000000:65:01.0 Sub System Id : 0x153210DE GPU Link Info PCIe Generation Max : 4 Current : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24258 MiB Used : 3 MiB Free : 24255 MiB BAR1 Memory Usage Total : 32768 MiB Used : 2 MiB Free : 32766 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 1 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 383 bank(s) High : 1 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 27 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 90 C GPU Target Temperature : N/A Memory Current Temp : 28 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 26.95 W Power Limit : 165.00 W Default Power Limit : 165.00 W Enforced Power Limit : 165.00 W Min Power Limit : 100.00 W Max Power Limit : 165.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1215 MHz Video : 585 MHz Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Default Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Max Clocks Graphics : 1440 MHz SM : 1440 MHz Memory : 1215 MHz Video : 1305 MHz Max Customer Boost Clocks Graphics : 1440 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 675.000 mV Processes : None

GPU 00000000:67:01.0 Product Name : NVIDIA A30 Product Brand : NVIDIA Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : Disabled Pending : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1324021109136 GPU UUID : GPU-4edca201-9195-cf92-98c7-e7c2d627742b Minor Number : 0 VBIOS Version : 92.00.66.00.03 MultiGPU Board : No Board ID : 0x6701 GPU Part Number : 900-21001-0040-000 Module ID : 0 Inforom Version Image Version : 1001.0205.00.02 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x67 Device : 0x01 Domain : 0x0000 Device Id : 0x20B710DE Bus Id : 00000000:67:01.0 Sub System Id : 0x153210DE GPU Link Info PCIe Generation Max : 4 Current : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24258 MiB Used : 3 MiB Free : 24255 MiB BAR1 Memory Usage Total : 32768 MiB Used : 2 MiB Free : 32766 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 384 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 29 C GPU Shutdown Temp : 100 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 90 C GPU Target Temperature : N/A Memory Current Temp : 30 C Memory Max Operating Temp : 95 C Power Readings Power Management : Supported Power Draw : 28.68 W Power Limit : 165.00 W Default Power Limit : 165.00 W Enforced Power Limit : 165.00 W Min Power Limit : 100.00 W Max Power Limit : 165.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1215 MHz Video : 585 MHz Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Default Applications Clocks Graphics : 930 MHz Memory : 1215 MHz Max Clocks Graphics : 1440 MHz SM : 1440 MHz Memory : 1215 MHz Video : 1305 MHz Max Customer Boost Clocks Graphics : 1440 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 687.500 mV Processes : None


- [ ] Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`)

{ "default-runtime": "nvidia", "default-ulimits": { "core": { "Hard": 0, "Name": "core", "Soft": 0 } }, "icc": true, "insecure-registries": [ "dockerhub.vpc.com", "dockerhub.vpc.com:5000" ], "iptables": true, "live-restore": true, "log-driver": "json-file", "log-opts": { "max-file": "10", "max-size": "50m", "mode": "non-blocking" }, "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "storage-driver": "overlay2" }

- [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

Additional information that might help better understand your environment and reproduce the bug:
- [ ] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.17 API version: 1.41 Go version: go1.17.11 Git commit: 100c701 Built: Mon Jun 6 23:05:12 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.17 API version: 1.41 (minimum version 1.12) Go version: go1.17.11 Git commit: a89b842 Built: Mon Jun 6 23:03:33 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.6 GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 nvidia: Version: 1.1.2 GitCommit: v1.1.2-0-ga916309 docker-init: Version: 0.19.0 GitCommit: de40ad0

- [ ] Docker command, image and tag used
- [ ] Kernel version from `uname -a`

Linux iv-yd1gn44wzkqc6il2ciqz 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

lianziqt commented 3 months ago

看了pkg/util/nodelock/nodelock.go的代码,感觉奇怪的是,hami-device-plugin的日志内,和锁相关的日志只有Node lock not set,没见到其他日志

lianziqt commented 3 months ago

kubectl describe pods pdf-model-deploy-fdfc65bfc-xg9sp -n volc-content-mng

Name:           pdf-model-deploy-fdfc65bfc-xg9sp
Namespace:      volc-content-mng
Priority:       0
Node:           192.168.0.5/
Start Time:     Tue, 04 Jun 2024 16:17:59 +0800
Labels:         app=pdf-model-deploy
                pod-template-hash=fdfc65bfc
Annotations:    consul.register/enabled: true
                consul.register/enabled.podip: true
                consul.register/port.9899: caijing.algo.pdf_onnx_model
Status:         Failed
Reason:         UnexpectedAdmissionError
Message:        Pod Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node 192.168.0.5, which is unexpected
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/pdf-model-deploy-fdfc65bfc
Containers:
  pdf-model-deploy-main:
    Image:      dockerhub.vpc.com:5000/volc_cms/volc.content.pdf_model_deploy:1.0.0.16
    Port:       9899/TCP
    Host Port:  0/TCP
    Command:
      bash
      /opt/tiger/caijing_pdf_model_deploy/deploy_pdf_model_service.sh
      v2
    Limits:
      cpu:             4
      memory:          12Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          4Gi
      nvidia.com/gpu:  1
    Environment:
      OnPremise:                  1
      IS_ON_PREMISE:              true
      HOST_IP:                     (v1:status.hostIP)
      MY_HOST_IP:                  (v1:status.hostIP)
      RUNTIME_IDC_NAME:           pri
      POD_IP:                      (v1:status.podIP)
      CONSUL_HTTP_HOST:            (v1:status.hostIP)
      PORT:                       9899
      USE_P2P:                    false
      ENABLE_MPS:                 false
      GPUS:                       1
      USE_MULTI_PROCESS:          false
      NUM_WORKERS_PER_GROUP:      1
      CONF:                       /opt/tiger/caijing_pdf_model_deploy/conf/rpc.conf
      LOG4J_CONF:                 /opt/tiger/caijing_pdf_model_deploy/conf/log4j.xml
      LOG4J_CHILD_CONF:           /opt/tiger/caijing_pdf_model_deploy/conf/log4j_child.xml
      ENABLE_TRACE:               false
      ENABLE_DYNAMIC_CONFIG:      false
      ENABLE_REPORT_QPS:          false
      DYNAMIC_GPU_NUM:            1
      QS_RUNTIME_ENABLE_WARM_UP:  0
    Mounts:
      /opt/tmp/sock from sock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9t8sb (ro)
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pdf-model-deploy-config
    Optional:  false
  sock:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/tmp/sock
    HostPathType:  
  default-token-9t8sb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9t8sb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  machine=gpu
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From               Message
  ----     ------                    ----  ----               -------
  Normal   Scheduled                 31s   default-scheduler  Successfully assigned volc-content-mng/pdf-model-deploy-fdfc65bfc-xg9sp to 192.168.0.5
  Warning  UnexpectedAdmissionError  31s   kubelet            Allocate failed due to rpc error: code = Unknown desc = no binding pod found on node 192.168.0.5, which is unexpected
lianziqt commented 3 months ago

kubectl describe nodes 192.168.0.5

Name:               192.168.0.5
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    elasticsearch=true
                    gpu=on
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=192.168.0.5
                    kubernetes.io/os=linux
                    machine=gpu
                    node-role.kubernetes.io/worker=true
                    shard-replica=1-1
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"5a:2a:50:05:fb:ea"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.0.5
                    hami.io/node-handshake: Requesting_2024.06.04 08:23:02
                    hami.io/node-handshake-dcu: Requesting_2024.05.28 13:01:11
                    hami.io/node-nvidia-register:
                      GPU-14637886-1d37-61c4-a71e-fb5ca495ef65,10,24258,100,NVIDIA-NVIDIA A30,0,true:GPU-4edca201-9195-cf92-98c7-e7c2d627742b,10,24258,100,NVIDI...
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.0.5/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.42.3.1
                    rke.cattle.io/external-ip: 192.168.0.5
                    rke.cattle.io/internal-ip: 192.168.0.5
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 01 Apr 2024 16:06:14 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  192.168.0.5
  AcquireTime:     <unset>
  RenewTime:       Tue, 04 Jun 2024 16:23:22 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 01 Apr 2024 16:06:38 +0800   Mon, 01 Apr 2024 16:06:38 +0800   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Tue, 04 Jun 2024 16:19:33 +0800   Mon, 01 Apr 2024 16:06:14 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 04 Jun 2024 16:19:33 +0800   Mon, 01 Apr 2024 16:06:14 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 04 Jun 2024 16:19:33 +0800   Mon, 01 Apr 2024 16:06:14 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 04 Jun 2024 16:19:33 +0800   Mon, 01 Apr 2024 16:06:45 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.0.5
  Hostname:    192.168.0.5
Capacity:
  cpu:                56
  ephemeral-storage:  154502324Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             482970120Ki
  nvidia.com/gpu:     20
  pods:               110
Allocatable:
  cpu:                54
  ephemeral-storage:  140241857915
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             479312392Ki
  nvidia.com/gpu:     20
  pods:               110
System Info:
  Machine ID:                 
  System UUID:                000C65FA-CB33-0000-0009-80D401443FEB
  Boot ID:                    3741823c-3c9c-49ad-a9ef-dd3e77e5cb59
  Kernel Version:             3.10.0-1160.102.1.el7.x86_64
  OS Image:                   CentOS Linux 7 (Core)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.17
  Kubelet Version:            v1.19.16
  Kube-Proxy Version:         v1.19.16
PodCIDR:                      10.42.3.0/24
PodCIDRs:                     10.42.3.0/24
archlitchi commented 3 months ago

pods with 'privileged:true' can't be scheduled, because there's no way to limit its GPU resource