Open blackjack2015 opened 2 years ago
@blackjack2015,
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR, Nik
@blackjack2015,
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR, Nik
Dear Nik,
Thanks for the prompt reply. Is it possible or planned to add support for those RTX GPUs? We are trying to leverage DCGM to conduct some performance modeling research and hope to have your help.
Best regards, Qiang Wang
@blackjack2015,
The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.
WBR, Nik
By the way, I have also tried profiling of DCGM on GTX 1650 SUPER and observed the same error:
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
Do the GTX cards support profiling with DCGM?
Best regards, Qiang Wang
The DCP metrics are only supported on Datacenter grade and Quadro GPUs. Neither RTX nor GTX kind of GPUs is supported. There are no plans to support those GPUs as it's a hardware limitation that does not allow us to provide low-latency profiling on RTX and GTX GPUs.
@nikkon-dev How about the NVIDIA RTX A4000? The NVIDIA RTX series represents a new series of Quadro GPUs, although regrettably, the DCGM does not seem to be compatible with it. For further information, kindly check the description on the NVIDIA's web page: https://www.nvidia.com/en-us/design-visualization/quadro/
@lilohuang,
Could you share the nvidia-smi -q
output?
@nikkon-dev FYR. Thanks!
lilo@bokeh:~$ dcgmi dmon -e 1002
#Entity SMACT
ID
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
lilo@bokeh:~$ nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Sat Jul 29 16:56:53 2023
Driver Version : 535.54.03
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:07:00.0
Product Name : NVIDIA RTX A4000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1563221025362
GPU UUID : GPU-622d54a3-f5d7-cb2d-6d95-51f9ba06809e
Minor Number : 0
VBIOS Version : 94.04.57.00.0A
MultiGPU Board : No
Board ID : 0x700
Board Part Number : 900-5G190-2700-003
GPU Part Number : 24B0-875-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G190.0510.00.02
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x07
Device : 0x00
Domain : 0x0000
Device Id : 0x24B010DE
Bus Id : 00000000:07:00.0
Sub System Id : 0x14AD17AA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 4
Host Max : 3
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 2
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 41 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15352 MiB
Reserved : 261 MiB
Used : 1 MiB
Free : 15089 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 1 MiB
Free : 255 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 128 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 35 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 103 C
GPU Slowdown Temp : 100 C
GPU Max Operating Temp : 98 C
GPU Target Temperature : 90 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 5.07 W
Current Power Limit : 140.00 W
Requested Power Limit : 140.00 W
Default Power Limit : 140.00 W
Min Power Limit : 100.00 W
Max Power Limit : 140.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Default Applications Clocks
Graphics : 1560 MHz
Memory : 7001 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 681.250 mV
Fabric
State : N/A
Status : N/A
Processes : None
Hi,
I have tried to monitor some fields of my GPUs (GTX 3090). The configuration is as follows:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:18:00.0 Off | N/A | | 30% 42C P8 26W / 350W | 17MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A | | 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A | | 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:AF:00.0 Off | N/A | | 31% 43C P8 38W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
When I tried the following command
dcgmi dmon -d 100 -e 1011 --host 127.0.0.1:39999
The system reported "Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded ".
Any suggestion? Thank you very much and looking forward to your reply!
Best regards, Qiang Wang