部署好HAMi之后重启k8s集群，原先切分的vGPU数量失效

stevechan1993 commented 1 week ago

What happened: 按照官方文档部署完成HAMi，并对一张3090显卡进行10份切分，部署完成测试均正常，也可以进行多任务同时使用显卡进行计算，但是在k8s集群重启之后，显卡配置又回到了物理显卡的数量，此时HAMi的两个POD处于运行中的状态：

What you expected to happen:

How to reproduce it (as minimally and precisely as possible): 部署完成HAMi，然后重启整个k8s集群

Anything else we need to know?:

The output of nvidia-smi -a on your host `==============NVSMI LOG==============

Timestamp : Wed Nov 6 11:28:02 2024 Driver Version : 555.42.06 CUDA Version : 12.5

Attached GPUs : 1 GPU 00000000:03:00.0 Product Name : NVIDIA GeForce RTX 3090 Product Brand : GeForce Product Architecture : Ampere Display Mode : Enabled Display Active : Enabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282 Minor Number : 0 VBIOS Version : 94.02.42.80.B8 MultiGPU Board : No Board ID : 0x300 Board Part Number : N/A GPU Part Number : 2204-300-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G001.0000.03.03 OEM Object : 2.0 ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GPU C2C Mode : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A vGPU Heterogeneous Mode : N/A GPU Reset Status Reset Required : No Drain and Reset Recommended : N/A GSP Firmware Version : 555.42.06 IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x03 Device : 0x00 Domain : 0x0000 Base Classcode : 0x3 Sub Classcode : 0x0 Device Id : 0x220410DE Bus Id : 00000000:03:00.0 Sub System Id : 0x145410DE GPU Link Info PCIe Generation Max : 3 Current : 1 Device Current : 1 Device Max : 4 Host Max : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 1 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : 0 % Performance State : P8 Clocks Event Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 24576 MiB Reserved : 416 MiB Used : 977 MiB Free : 23185 MiB BAR1 Memory Usage Total : 256 MiB Used : 12 MiB Free : 244 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 2 % Encoder : 0 % Decoder : 0 % JPEG : 0 % OFA : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A SRAM Threshold Exceeded : N/A Aggregate Uncorrectable SRAM Sources SRAM L2 : N/A SRAM SM : N/A SRAM Microcontroller : N/A SRAM PCIE : N/A SRAM Other : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 35 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 93 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 20.40 W Current Power Limit : 420.00 W Requested Power Limit : 420.00 W Default Power Limit : 420.00 W Min Power Limit : 100.00 W Max Power Limit : 450.00 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 0 MHz SM : 0 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 9751 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 0.000 mV Fabric State : N/A Status : N/A CliqueId : N/A ClusterUUID : N/A Health Bandwidth : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1461 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 214 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 1622 Type : C+G Name : /usr/libexec/gnome-remote-desktop-daemon Used GPU Memory : 258 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 1692 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 76 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2034 Type : G Name : /usr/local/sunlogin/bin/sunloginclient --cmd=autorun Used GPU Memory : 13 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2308 Type : G Name : /usr/local/sunlogin/bin/sunloginclient --type=zygote --no-sandbox --lang=en-US --locales-dir-path=/usr/local/s unlogin/res --log-file=/usr/local/sunlogin/bin/debug.log --resources-dir-path=/usr/local/sunlogin/res --user-agent=SLRC/15.2.0.62802 (Linux,x64,Person,log inver=10,appname=sunloginRemoteClient) Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36 Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2799 Type : G Name : /proc/self/exe Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2961 Type : C Name : python3 Used GPU Memory : 358 MiB Capabilities EGM : disabled `

Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "ResourceName": "nvidia.com/gpu", "DebugMode": null } I1106 07:11:17.270016 28547 main.go:272] Retrieving plugins. I1106 07:11:17.270786 28547 factory.go:107] Detected NVML platform: found NVML library I1106 07:11:17.270821 28547 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1106 07:11:17.302135 28547 server.go:185] Starting GRPC server for 'nvidia.com/gpu' I1106 07:11:17.303025 28547 server.go:133] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I1106 07:11:17.304961 28547 server.go:141] Registered device plugin for 'nvidia.com/gpu' with Kubelet I1106 07:11:17.304996 28547 register.go:187] Starting WatchAndRegister I1106 07:11:17.319023 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:17.405555 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:17.405687 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:17.453840 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:17.453876 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:17.453859779 +0000 UTC m=+0.198469450 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:17.478966 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:11:47.480078 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:47.548145 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:47.548181 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:47.556014 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:47.556046 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:47.556026421 +0000 UTC m=+30.300636097 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:47.576159 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:12:17.576726 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg

Environment:

HAMi version: v2.4.0
nvidia driver or other AI device driver version: 555.42.06
Docker version from docker version Containerd Version: 1.6.28
Docker command, image and tag used
Kernel version from uname -a
Others:

Nimbus318 commented 1 week ago

提供一下 hami-device-plugin 这个 ds 的 device-plugin container 的启动命令，还有当时部署 values 的这个的值是多少： https://github.com/Project-HAMi/HAMi/blob/6df0d0bfa1e212a4f4db3169a3356ff19a470092/charts/hami/values.yaml#L118
如果是 1，那么修改成 10，看是否符合 10 份切分的期望，符合期望后重启看能否复现你的问题

stevechan1993 commented 1 week ago

您好，启动命令如下：一开始使用的是默认的10份切分，重启之后失效，猜测是默认参数丢失问题，后续则在helm进行安装时手动设置为了15份，重启集群还是出现了失效问题

Nimbus318 commented 1 week ago

现在有失效后的环境吗，node 的 annotation 发一下

Nimbus318 commented 1 week ago

从最开始你放上去的 dp 的 log

The hami-device-plugin container logs "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "ResourceName": "nvidia.com/gpu", "DebugMode": null } I1106 07:11:17.270016 28547 main.go:272] Retrieving plugins. I1106 07:11:17.270786 28547 factory.go:107] Detected NVML platform: found NVML library I1106 07:11:17.270821 28547 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1106 07:11:17.302135 28547 server.go:185] Starting GRPC server for 'nvidia.com/gpu' I1106 07:11:17.303025 28547 server.go:133] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I1106 07:11:17.304961 28547 server.go:141] Registered device plugin for 'nvidia.com/gpu' with Kubelet I1106 07:11:17.304996 28547 register.go:187] Starting WatchAndRegister I1106 07:11:17.319023 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:17.405555 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:17.405687 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:17.453840 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:17.453876 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:17.453859779 +0000 UTC m=+0.198469450 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:17.478966 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:11:47.480078 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:47.548145 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:47.548181 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:47.556014 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:47.556046 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:47.556026421 +0000 UTC m=+30.300636097 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:47.576159 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:12:17.576726 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576

来看，该节点的设备注册都是符合预期的，15 份的切分上限，我最开始理解的失效是这里的[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}]其中"count":15变成了"count":1，导致多 pod 无法共享该显卡，看来我没太理解失效的具体表现是啥，只能一个算力任务用？而且直接分配了整卡？新 pod Pending？还是什么

stevechan1993 commented 6 days ago

这是节点的annotation: 失效的表现是这样的，我在安装好hami后成功切分了15份的vgpu，然后成功启动了10个pod，每个pod都调用了一块vgpu，然后重启集群之后这10个调用vgpu的pod全部failed了，用命令：sudo kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu" 查看节点的gpu数量，又变成了1

stevechan1993 commented 4 days ago

是不是安装hami之前不能安装nvidia-k8s-device-plugin

Nimbus318 commented 3 days ago

@stevechan1993 是的，hami 的 dp 和 nvidia-k8s-device-plugin 默认是用的一个资源类型 nvidia.com/gpu，从 nvidia-k8s-device-plugin 迁移过来比较顺滑，但是两个 dp 同时存在就可能出现覆写的情况

stevechan1993 commented 3 days ago

非常感谢，已经验证没有问题了

Project-HAMi / HAMi

部署好HAMi之后重启k8s集群，原先切分的vGPU数量失效 #595