Open devenami opened 1 month ago
What happened:
The Device_utilization_desc_of_container value is 0 when nvidia.com/gpucores is not present
only has one gpu usage rate value when set nvidia.com/gpu="2"
What you expected to happen:
metrics data correctly
How to reproduce it (as minimally and precisely as possible):
first, create pod use below resource config
delete the pod when pod running.
second, update pod resource as below
create the pod again.
see the metrics value.
Anything else we need to know?:
nvidia-smi -a
root@vllm-travel-3:/workspace# nvidia-smi -a ==============NVSMI LOG============== Timestamp : Sat Oct 12 19:09:47 2024 Driver Version : 555.42.06 [HAMI-core Msg(1801:140562883144576:libvgpu.c:836)]: Initializing..... CUDA Version : 12.5 Attached GPUs : 2 GPU 00000000:9B:00.0 Product Name : NVIDIA L40S Product Brand : NVIDIA Product Architecture : Ada Lovelace Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1323723028559 GPU UUID : GPU-bb534283-8934-4c8a-4023-d744817bfbf2 Minor Number : 4 VBIOS Version : 95.02.66.00.02 MultiGPU Board : No Board ID : 0x9b00 Board Part Number : 900-2G133-0080-000 GPU Part Number : 26B9-896-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G133.0242.00.03 OEM Object : 2.1 ECC Object : 6.16 Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GPU C2C Mode : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A vGPU Heterogeneous Mode : N/A GPU Reset Status Reset Required : No Drain and Reset Recommended : N/A GSP Firmware Version : 555.42.06 IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x9B Device : 0x00 Domain : 0x0000 Base Classcode : 0x3 Sub Classcode : 0x2 Device Id : 0x26B910DE Bus Id : 00000000:9B:00.0 Sub System Id : 0x185110DE GPU Link Info PCIe Generation Max : 4 Current : 4 Device Current : 4 Device Max : 4 Host Max : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 2 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 46068 MiB Reserved : 574 MiB Used : 40554 MiB Free : 4265 MiB BAR1 Memory Usage Total : 65536 MiB Used : 4 MiB Free : 65532 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % JPEG : 0 % OFA : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : No Aggregate Uncorrectable SRAM Sources SRAM L2 : 0 SRAM SM : 0 SRAM Microcontroller : 0 SRAM PCIE : 0 SRAM Other : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 192 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 39 C GPU T.Limit Temp : 49 C GPU Shutdown T.Limit Temp : -5 C GPU Slowdown T.Limit Temp : -2 C GPU Max Operating T.Limit Temp : 0 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating T.Limit Temp : N/A GPU Power Readings Power Draw : 81.14 W Current Power Limit : 350.00 W Requested Power Limit : 350.00 W Default Power Limit : 350.00 W Min Power Limit : 100.00 W Max Power Limit : 350.00 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 2520 MHz SM : 2520 MHz Memory : 9000 MHz Video : 1965 MHz Applications Clocks Graphics : 2520 MHz Memory : 9001 MHz Default Applications Clocks Graphics : 2520 MHz Memory : 9001 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 2520 MHz SM : 2520 MHz Memory : 9001 MHz Video : 1965 MHz Max Customer Boost Clocks Graphics : 2520 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 970.000 mV Fabric State : N/A Status : N/A CliqueId : N/A ClusterUUID : N/A Health Bandwidth : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1646741 Type : C Name : Used GPU Memory : 41212 MiB Capabilities EGM : disabled GPU 00000000:9D:00.0 Product Name : NVIDIA L40S Product Brand : NVIDIA Product Architecture : Ada Lovelace Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1323723028611 GPU UUID : GPU-3c79048a-fbe0-a52f-0eaf-84739de8f317 Minor Number : 6 VBIOS Version : 95.02.66.00.02 MultiGPU Board : No Board ID : 0x9d00 Board Part Number : 900-2G133-0080-000 GPU Part Number : 26B9-896-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G133.0242.00.03 OEM Object : 2.1 ECC Object : 6.16 Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GPU C2C Mode : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A vGPU Heterogeneous Mode : N/A GPU Reset Status Reset Required : No Drain and Reset Recommended : N/A GSP Firmware Version : 555.42.06 IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x9D Device : 0x00 Domain : 0x0000 Base Classcode : 0x3 Sub Classcode : 0x2 Device Id : 0x26B910DE Bus Id : 00000000:9D:00.0 Sub System Id : 0x185110DE GPU Link Info PCIe Generation Max : 4 Current : 4 Device Current : 4 Device Max : 4 Host Max : 4 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 46068 MiB Reserved : 574 MiB Used : 40508 MiB Free : 4311 MiB BAR1 Memory Usage Total : 65536 MiB Used : 4 MiB Free : 65532 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % JPEG : 0 % OFA : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable Parity : 0 SRAM Uncorrectable SEC-DED : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 SRAM Threshold Exceeded : No Aggregate Uncorrectable SRAM Sources SRAM L2 : 0 SRAM SM : 0 SRAM Microcontroller : 0 SRAM PCIE : 0 SRAM Other : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 192 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 40 C GPU T.Limit Temp : 48 C GPU Shutdown T.Limit Temp : -5 C GPU Slowdown T.Limit Temp : -2 C GPU Max Operating T.Limit Temp : 0 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating T.Limit Temp : N/A GPU Power Readings Power Draw : 82.76 W Current Power Limit : 350.00 W Requested Power Limit : 350.00 W Default Power Limit : 350.00 W Min Power Limit : 100.00 W Max Power Limit : 350.00 W GPU Memory Power Readings Power Draw : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 2520 MHz SM : 2520 MHz Memory : 9000 MHz Video : 1965 MHz Applications Clocks Graphics : 2520 MHz Memory : 9001 MHz Default Applications Clocks Graphics : 2520 MHz Memory : 9001 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 2520 MHz SM : 2520 MHz Memory : 9001 MHz Video : 1965 MHz Max Customer Boost Clocks Graphics : 2520 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 1005.000 mV Fabric State : N/A Status : N/A CliqueId : N/A ClusterUUID : N/A Health Bandwidth : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1648336 Type : C Name : Used GPU Memory : 41166 MiB Capabilities EGM : disabled [HAMI-core Msg(1801:140562883144576:multiprocess_memory_limit.c:468)]: Calling exit handler 1801
/etc/docker/daemon.json
sudo journalctl -r -u kubelet
dmesg
Environment:
docker version
uname -a
What happened:
The Device_utilization_desc_of_container value is 0 when nvidia.com/gpucores is not present
only has one gpu usage rate value when set nvidia.com/gpu="2"
What you expected to happen:
metrics data correctly
How to reproduce it (as minimally and precisely as possible):
first, create pod use below resource config
delete the pod when pod running.
second, update pod resource as below
create the pod again.
see the metrics value.
Anything else we need to know?:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)dmesg
Environment:
docker version
uname -a