NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
404 stars 52 forks source link

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded #50

Open blackjack2015 opened 2 years ago

blackjack2015 commented 2 years ago

Hi,

I have tried to monitor some fields of my GPUs (GTX 3090). The configuration is as follows:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:18:00.0 Off | N/A | | 30% 42C P8 26W / 350W | 17MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A | | 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A | | 31% 42C P8 23W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:AF:00.0 Off | N/A | | 31% 43C P8 38W / 350W | 7MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

When I tried the following command

dcgmi dmon -d 100 -e 1011 --host 127.0.0.1:39999

The system reported "Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded ".

Any suggestion? Thank you very much and looking forward to your reply!

Best regards, Qiang Wang

nikkon-dev commented 2 years ago

@blackjack2015,

The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.

WBR, Nik

blackjack2015 commented 2 years ago

@blackjack2015,

The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.

WBR, Nik

Dear Nik,

Thanks for the prompt reply. Is it possible or planned to add support for those RTX GPUs? We are trying to leverage DCGM to conduct some performance modeling research and hope to have your help.

Best regards, Qiang Wang

blackjack2015 commented 2 years ago

@blackjack2015,

The DCP family of metrics (1001-1015) are not supported on RTX GPUs. The profiling module is not loaded if supported GPUs are not detected.

WBR, Nik

By the way, I have also tried profiling of DCGM on GTX 1650 SUPER and observed the same error:

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

Do the GTX cards support profiling with DCGM?

Best regards, Qiang Wang

nikkon-dev commented 2 years ago

The DCP metrics are only supported on Datacenter grade and Quadro GPUs. Neither RTX nor GTX kind of GPUs is supported. There are no plans to support those GPUs as it's a hardware limitation that does not allow us to provide low-latency profiling on RTX and GTX GPUs.

lilohuang commented 1 year ago

@nikkon-dev How about the NVIDIA RTX A4000? The NVIDIA RTX series represents a new series of Quadro GPUs, although regrettably, the DCGM does not seem to be compatible with it. For further information, kindly check the description on the NVIDIA's web page: https://www.nvidia.com/en-us/design-visualization/quadro/

nikkon-dev commented 1 year ago

@lilohuang,

Could you share the nvidia-smi -q output?

lilohuang commented 1 year ago

@nikkon-dev FYR. Thanks!

lilo@bokeh:~$ dcgmi dmon -e 1002
#Entity   SMACT        
ID                     
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

lilo@bokeh:~$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Sat Jul 29 16:56:53 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:07:00.0
    Product Name                          : NVIDIA RTX A4000
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1563221025362
    GPU UUID                              : GPU-622d54a3-f5d7-cb2d-6d95-51f9ba06809e
    Minor Number                          : 0
    VBIOS Version                         : 94.04.57.00.0A
    MultiGPU Board                        : No
    Board ID                              : 0x700
    Board Part Number                     : 900-5G190-2700-003
    GPU Part Number                       : 24B0-875-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G190.0510.00.02
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x07
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x24B010DE
        Bus Id                            : 00000000:07:00.0
        Sub System Id                     : 0x14AD17AA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 3
            Link Width
                Max                       : 16x
                Current                   : 4x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 2
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 41 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 15352 MiB
        Reserved                          : 261 MiB
        Used                              : 1 MiB
        Free                              : 15089 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 128 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 35 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 103 C
        GPU Slowdown Temp                 : 100 C
        GPU Max Operating Temp            : 98 C
        GPU Target Temperature            : 90 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 5.07 W
        Current Power Limit               : 140.00 W
        Requested Power Limit             : 140.00 W
        Default Power Limit               : 140.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 140.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1560 MHz
        Memory                            : 7001 MHz
    Default Applications Clocks
        Graphics                          : 1560 MHz
        Memory                            : 7001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 681.250 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None