Closed stevechan1993 closed 2 days ago
提供一下 hami-device-plugin 这个 ds 的 device-plugin container 的启动命令,还有当时部署 values 的这个的值是多少: https://github.com/Project-HAMi/HAMi/blob/6df0d0bfa1e212a4f4db3169a3356ff19a470092/charts/hami/values.yaml#L118
如果是 1,那么修改成 10,看是否符合 10 份切分的期望,符合期望后重启看能否复现你的问题
您好,启动命令如下: 一开始使用的是默认的10份切分,重启之后失效,猜测是默认参数丢失问题,后续则在helm进行安装时手动设置为了15份,重启集群还是出现了失效问题
现在有失效后的环境吗,node 的 annotation 发一下
从最开始你放上去的 dp 的 log
The hami-device-plugin container logs "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "ResourceName": "nvidia.com/gpu", "DebugMode": null } I1106 07:11:17.270016 28547 main.go:272] Retrieving plugins. I1106 07:11:17.270786 28547 factory.go:107] Detected NVML platform: found NVML library I1106 07:11:17.270821 28547 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1106 07:11:17.302135 28547 server.go:185] Starting GRPC server for 'nvidia.com/gpu' I1106 07:11:17.303025 28547 server.go:133] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I1106 07:11:17.304961 28547 server.go:141] Registered device plugin for 'nvidia.com/gpu' with Kubelet I1106 07:11:17.304996 28547 register.go:187] Starting WatchAndRegister I1106 07:11:17.319023 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:17.405555 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:17.405687 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:17.453840 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:17.453876 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:17.453859779 +0000 UTC m=+0.198469450 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:17.478966 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:11:47.480078 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:47.548145 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:47.548181 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:47.556014 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:47.556046 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:47.556026421 +0000 UTC m=+30.300636097 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:47.576159 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:12:17.576726 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576
来看,该节点的设备注册都是符合预期的,15 份的切分上限,我最开始理解的失效是这里的[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}]
其中"count":15
变成了"count":1
,导致多 pod 无法共享该显卡,看来我没太理解失效的具体表现是啥,只能一个算力任务用?而且直接分配了整卡?新 pod Pending?还是什么
这是节点的annotation: 失效的表现是这样的,我在安装好hami后成功切分了15份的vgpu,然后成功启动了10个pod,每个pod都调用了一块vgpu,然后重启集群之后这10个调用vgpu的pod全部failed了,用命令:sudo kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu" 查看节点的gpu数量,又变成了1
是不是安装hami之前不能安装nvidia-k8s-device-plugin
@stevechan1993 是的,hami 的 dp 和 nvidia-k8s-device-plugin 默认是用的一个资源类型 nvidia.com/gpu
,从 nvidia-k8s-device-plugin 迁移过来比较顺滑,但是两个 dp 同时存在就可能出现覆写的情况
非常感谢,已经验证没有问题了
What happened: 按照官方文档部署完成HAMi,并对一张3090显卡进行10份切分,部署完成测试均正常,也可以进行多任务同时使用显卡进行计算,但是在k8s集群重启之后,显卡配置又回到了物理显卡的数量,此时HAMi的两个POD处于运行中的状态:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible): 部署完成HAMi,然后重启整个k8s集群
Anything else we need to know?:
nvidia-smi -a
on your host `==============NVSMI LOG==============Timestamp : Wed Nov 6 11:28:02 2024 Driver Version : 555.42.06 CUDA Version : 12.5
Attached GPUs : 1 GPU 00000000:03:00.0 Product Name : NVIDIA GeForce RTX 3090 Product Brand : GeForce Product Architecture : Ampere Display Mode : Enabled Display Active : Enabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282 Minor Number : 0 VBIOS Version : 94.02.42.80.B8 MultiGPU Board : No Board ID : 0x300 Board Part Number : N/A GPU Part Number : 2204-300-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G001.0000.03.03 OEM Object : 2.0 ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GPU C2C Mode : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A vGPU Heterogeneous Mode : N/A GPU Reset Status Reset Required : No Drain and Reset Recommended : N/A GSP Firmware Version : 555.42.06 IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x03 Device : 0x00 Domain : 0x0000 Base Classcode : 0x3 Sub Classcode : 0x0 Device Id : 0x220410DE Bus Id : 00000000:03:00.0 Sub System Id : 0x145410DE GPU Link Info PCIe Generation Max : 3 Current : 1 Device Current : 1 Device Max : 4 Host Max : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 1 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : 0 % Performance State : P8 Clocks Event Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 24576 MiB Reserved : 416 MiB Used : 977 MiB Free : 23185 MiB BAR1 Memory Usage Total : 256 MiB Used : 12 MiB Free : 244 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 2 % Encoder : 0 % Decoder : 0 % JPEG : 0 % OFA : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable Parity : N/A SRAM Uncorrectable SEC-DED : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A SRAM Threshold Exceeded : N/A Aggregate Uncorrectable SRAM Sources SRAM L2 : N/A SRAM SM : N/A SRAM Microcontroller : N/A SRAM PCIE : N/A SRAM Other : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 35 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 93 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 20.40 W Current Power Limit : 420.00 W Requested Power Limit : 420.00 W Default Power Limit : 420.00 W Min Power Limit : 100.00 W Max Power Limit : 450.00 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 0 MHz SM : 0 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 9751 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 0.000 mV Fabric State : N/A Status : N/A CliqueId : N/A ClusterUUID : N/A Health Bandwidth : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1461 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 214 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 1622 Type : C+G Name : /usr/libexec/gnome-remote-desktop-daemon Used GPU Memory : 258 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 1692 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 76 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2034 Type : G Name : /usr/local/sunlogin/bin/sunloginclient --cmd=autorun Used GPU Memory : 13 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2308 Type : G Name : /usr/local/sunlogin/bin/sunloginclient --type=zygote --no-sandbox --lang=en-US --locales-dir-path=/usr/local/s unlogin/res --log-file=/usr/local/sunlogin/bin/debug.log --resources-dir-path=/usr/local/sunlogin/res --user-agent=SLRC/15.2.0.62802 (Linux,x64,Person,log inver=10,appname=sunloginRemoteClient) Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36 Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2799 Type : G Name : /proc/self/exe Used GPU Memory : 4 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2961 Type : C Name : python3 Used GPU Memory : 358 MiB Capabilities EGM : disabled `
/etc/docker/daemon.json
)"name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} }, "ResourceName": "nvidia.com/gpu", "DebugMode": null } I1106 07:11:17.270016 28547 main.go:272] Retrieving plugins. I1106 07:11:17.270786 28547 factory.go:107] Detected NVML platform: found NVML library I1106 07:11:17.270821 28547 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I1106 07:11:17.302135 28547 server.go:185] Starting GRPC server for 'nvidia.com/gpu' I1106 07:11:17.303025 28547 server.go:133] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I1106 07:11:17.304961 28547 server.go:141] Registered device plugin for 'nvidia.com/gpu' with Kubelet I1106 07:11:17.304996 28547 register.go:187] Starting WatchAndRegister I1106 07:11:17.319023 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:17.405555 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:17.405687 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:17.453840 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:17.453876 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:17.453859779 +0000 UTC m=+0.198469450 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:17.478966 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:11:47.480078 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576 I1106 07:11:47.548145 28547 register.go:160] nvml registered device id=1, memory=24576, type=NVIDIA GeForce RTX 3090, numa=0 I1106 07:11:47.548181 28547 register.go:167] "start working on the devices" devices=[{"id":"GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282","count":15,"devmem":24576,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 3090","health":true}] I1106 07:11:47.556014 28547 util.go:163] Encoded node Devices: GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true: I1106 07:11:47.556046 28547 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2024-11-06 07:11:47.556026421 +0000 UTC m=+30.300636097 hami.io/node-nvidia-register:GPU-19ec7951-8ef1-7ae2-6b27-6e38506c7282,15,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:] I1106 07:11:47.576159 28547 register.go:197] Successfully registered annotation. Next check in 30s seconds... I1106 07:12:17.576726 28547 register.go:132] MemoryScaling= 1 registeredmem= 24576
sudo journalctl -r -u kubelet
)dmesg
Environment:
docker version
Containerd Version: 1.6.28uname -a