Open guleng opened 3 years ago
I don't think I understand the issue you are reporting. Are you expecting a different status for the dcgm-exporter pod?
@glowkey The service statistics using GPU card are wrong. I have multiple jobs occupying GPU cards, but they release the GPU card when the task is completed, but the exporter calculates that they are occupying the GPU card. metrics name is (DCGM_FI_DEV_FB_FREE)
@guleng nvidia-smi -q
will also display the FB memory used/free. You can use that to verify the information displayed by the exporter. If they are different please include examples and outputs.
When I checked the usage of Gou in k8s, I saw that only four cards were used What I found in dcgm exporter is that five are used
deepblue@gpu-178:~$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Wed Sep 29 10:16:18 2021
Driver Version : 460.32.03
CUDA Version : 11.2
Attached GPUs : 8
GPU 00000000:3D:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619055326
GPU UUID : GPU-8db102ce-42a0-03d6-f16b-a9c7f3c415f7
Minor Number : 0
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0x3d00
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x3D
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:3D:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 25.58 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:3E:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619055580
GPU UUID : GPU-aac9cf08-dd4e-9f7b-7fa8-8e1b8ecaa566
Minor Number : 1
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0x3e00
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x3E
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:3E:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 23 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 25.82 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:40:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619054568
GPU UUID : GPU-5d40d611-0236-e9ba-0abc-dd5a69a90596
Minor Number : 2
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0x4000
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x40
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:40:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 27.05 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:41:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324518129938
GPU UUID : GPU-e838377e-3604-e4ac-839d-362d51dc641d
Minor Number : 3
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0x4100
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 1055 MiB
Free : 15225 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 31.17 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 1189 MHz
SM : 1189 MHz
Memory : 715 MHz
Video : 1075 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 180997
Type : C
Name : python
Used GPU Memory : 1053 MiB
GPU 00000000:B1:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619055598
GPU UUID : GPU-947d6215-69d2-5f3c-17a8-23e255c3d02f
Minor Number : 4
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0xb100
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xB1
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:B1:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 24.61 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:B2:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619055222
GPU UUID : GPU-608e0214-1c9c-e52d-7559-a18e30803830
Minor Number : 5
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0xb200
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xB2
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:B2:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 26.56 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:B4:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619054891
GPU UUID : GPU-358cc925-7017-36cc-5a44-25b679ff6453
Minor Number : 6
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0xb400
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xB4
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:B4:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 24 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 25.33 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
GPU 00000000:B5:00.0
Product Name : Tesla P100-PCIE-16GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0321619054768
GPU UUID : GPU-17375115-1b53-f390-4496-3f2ab536acb4
Minor Number : 7
VBIOS Version : 86.00.4D.00.01
MultiGPU Board : No
Board ID : 0xb500
GPU Part Number : 900-2H400-0000-000
Inforom Version
Image Version : H400.0201.00.08
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xB5
Device : 0x00
Domain : 0x0000
Device Id : 0x15F810DE
Bus Id : 00000000:B5:00.0
Sub System Id : 0x118F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16280 MiB
Used : 0 MiB
Free : 16280 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : N/A
L2 Cache : 0
Texture Memory : 0
Texture Shared : 0
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 27 C
GPU Shutdown Temp : 85 C
GPU Slowdown Temp : 82 C
GPU Max Operating Temp : N/A
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 26.55 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 405 MHz
SM : 405 MHz
Memory : 715 MHz
Video : 835 MHz
Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Default Applications Clocks
Graphics : 1189 MHz
Memory : 715 MHz
Max Clocks
Graphics : 1328 MHz
SM : 1328 MHz
Memory : 715 MHz
Video : 1328 MHz
Max Customer Boost Clocks
Graphics : 1328 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
The pod is in the completed state and cannot be counted as GPU card
This should be caused by the lack of judgment when the exporter obtains from /var/lib/kubelet/pod-resources
@guleng,
It's not clear what do you want to change here.
The resource will be associated with a pod until the pod is terminated.
See some K8s sources for reference:
WBR, Nik
The meaning is very simple. The task service starts to apply for GPU Card Association, and then ends to complete the task. This means that it is time to release the GPU card. Viewing in the k8s cluster and the nvidia-smi command show that the GPU card is in free state, but the metrics given by the dcgm-exporter are the completed job and the GPU card or in associated state.
My command to deploy dcgm exporter is:kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
I suspect that the reason for this problem is that the metrics to obtain the information related to the pod and GPU card is not determined when it is obtained from here?
Hi @guleng,
The dcgm-exporter acquires the information related to pod resources on every metrics request via POD API - that very /var/lib/kubelet/pod-resources socket. We do not cache or store that information anywhere but the metrics' labels.
K8s does not terminate a pod instantly once a container is completed. There might be another task that k8s could assign to the same pod in a short time, so k8s keeps the pod for a while. And resources are allocated for pods, not for containers. Until k8s decides to terminate a pod, the resources will remain allocated and assigned to that pod. I referenced the exact locations where k8s makes decisions on resources deallocation in a comment above.
There is nothing dcgm-exporter can do about this - we have to trust the information k8s returns us.
To emphasize this again: a completed job does not mean the resource (a GPU allocated to a pod) is freed. Only when the pod is terminated, the resource is deallocated.
WBR, Nik
@nikkon-dev Hey buddy, I followed the code link you provided earlier to trace the device release process in Kubelet, and it seems there are no issues.
However, I've encountered the same problem: the metric update is a few minutes slower than the resource release seen through kubectl. I want to know where this delay of several minutes is occurring. In short, when a pod is completed, the GPU resources are immediately visible as released through kubectl, but it takes more than 2 minutes to see the metric update through dcgm-exporter. If you know the reason, please let me know. Thanks a lot.
I have a pod in status Completed, and I use a GPU card ‘kubectl describe node gpu-178‘ View and from exporte dissimilarity,Obviously, dcgm exporter has included the cards of the completed pod