Open marceloamaral opened 7 months ago
@nikkon-dev, whenever you have a time, could you please take a look at this issue?
Could you provide nvidia-smi
and nvidia-smi -q
output?
nvidia-smi
Wed Feb 14 08:49:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000: Off | On |
| N/A 33C P0 85W / 400W | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 3 1 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 2 0 1 | 49MiB / 40192MiB | 56 0 | 4 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvidia-smi -q -i 3
==============NVSMI LOG==============
Timestamp : Wed Feb 14 08:47:31 2024
Driver Version : 535.104.05
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:
Product Name : NVIDIA A100-SXM4-80GB
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Enabled
Pending : Enabled
MIG Device
Index : 0
GPU Instance ID : 1
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 42
Copy Engine count : 3
Encoder count : 0
Decoder count : 2
OFA count : 0
JPG count : 0
ECC Errors
Volatile
SRAM Uncorrectable : 0
FB Memory Usage
Total : 40192 MiB
Reserved : 0 MiB
Used : 37 MiB
Free : 40154 MiB
BAR1 Memory
Total : 65535 MiB
Used : 0 MiB
Free : 65535 MiB
MIG Device
Index : 1
GPU Instance ID : 2
Compute Instance ID : 0
Device Attributes
Shared
Multiprocessor count : 56
Copy Engine count : 4
Encoder count : 0
Decoder count : 2
OFA count : 0
JPG count : 0
ECC Errors
Volatile
SRAM Uncorrectable : 0
FB Memory Usage
Total : 40192 MiB
Reserved : 0 MiB
Used : 49 MiB
Free : 40142 MiB
BAR1 Memory
Total : 65535 MiB
Used : 0 MiB
Free : 65535 MiB
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : XXXXXX
GPU UUID : GPU-072b19bd-4ccf-63cf-6435-cf462aff833d
Minor Number : 5
VBIOS Version : 92.00.45.00.05
MultiGPU Board : No
Board ID : 0xa04
Board Part Number : XXXX
GPU Part Number : XXXX
FRU Part Number : N/A
Module ID : 8
Inforom Version
Image Version : G506.0210.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.104.05
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x0A
Device : 0x04
Domain : 0x0000
Device Id : 0x20B210DE
Bus Id : 00000000:0A:04.0
Sub System Id : 0x146310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : N/A
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : Insufficient Permissions
Reserved : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
BAR1 Memory Usage
Total : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
Conf Compute Protected Memory Usage
Total : Insufficient Permissions
Used : Insufficient Permissions
Free : Insufficient Permissions
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 33 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 47 C
Memory Max Operating Temp : 95 C
GPU Power Readings
Power Draw : 85.09 W
Current Power Limit : 400.00 W
Requested Power Limit : 400.00 W
Default Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 893.750 mV
Fabric
State : N/A
Status : N/A
Processes : None
@nikkon-dev ^^
Tensor Core utilization is also higher than 1.0
dcgmi dmon -e 1002,1003,1004,155 -g 18
#Entity SMACT SMOCC TENSO POWER
ID W
GPU-I 21 1.309 0.164 1.255 400.336
GPU-I 22 0.977 0.122 0.933 400.336
@nikkon-dev any progress on that?
I am trying to understand the following metrics for MIG instances:
In my experiment I am running the following workload:
dcgmproftester12 --no-dcgm-validation -i 3 -t 1004 -d 120
The GPU 3 is configured with two MIG instances.
To collect the metrics, I am running the following command:
What I don't understand is the metric
DCGM_FI_PROF_SM_ACTIVE / SMACT / 1002
. Which is the ratio of cycles that the SM is active divided by the total number of cycles, so a value of 1 means 100% of utilization. However, for MIG, dcgmi dmon is showing ~1.3, which is 130% of utilization.So, what does it means? Although it could represent that one MIG instance is accessing more SM that it has allocated, as far as I understood on MIG instance cannot access SM of other MIGs...
So, is it a bug or there is a different interpretation for this metric?