NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
352 stars 48 forks source link

Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so #120

Open nguoido opened 8 months ago

nguoido commented 8 months ago

When I run dcgmi diag -r 4, I get this issue. But I find libDiagnostic.so, libSoftware.so ... at /usr/share/nvidia-validation-suite/plugins/cuda12 or /usr/share/nvidia-validation-suite/plugins/cuda11. Can you help me this issue?

2023-10-26 12:30:41.636 ERROR [4190:4190] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2] 2023-10-26 12:30:41.636 ERROR [4190:4190] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]

023-10-26 12:30:42.825 ERROR [4190:4190] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]

2023-10-26 12:30:42.826 ERROR [4190:4190] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]

nikkon-dev commented 8 months ago

@nguoido,

If you enable the debug logs, you should see the following message after those errors: Plugin does not have a ShutdownPlugin function. This is not an error.. Those are not actual errors; we will improve the module loading in the future so that new APIs do not cause such false errors.

jiaxinonly commented 7 months ago
2023-12-07 16:05:54.702 DEBUG [84075:84075] No MIG devices are configured on GPU 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.748 DEBUG [84075:84075] No MIG devices are configured on GPU 1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.835 DEBUG [84075:84075] No MIG devices are configured on GPU 2 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.923 DEBUG [84075:84075] No MIG devices are configured on GPU 3 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.012 DEBUG [84075:84075] No MIG devices are configured on GPU 4 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.099 DEBUG [84075:84075] No MIG devices are configured on GPU 5 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.188 DEBUG [84075:84075] No MIG devices are configured on GPU 6 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.275 DEBUG [84075:84075] No MIG devices are configured on GPU 7 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.279 DEBUG [84075:84075] The following Cuda version will be used for plugins: 12.2 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:217] [TestFramework::GetPluginDirExtension]
2023-12-07 16:05:55.279 DEBUG [84075:84075] Searching /usr/share/nvidia-validation-suite/plugins for plugins. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:187] [TestFramework::GetPluginBaseDir]
2023-12-07 16:05:55.280 DEBUG [84075:84075] Successfully loaded dlib libpluginCommon.so [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:270] [TestFramework::LoadLibrary]
2023-12-07 16:05:55.306 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libPcie.so: ./libPcie.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:55.306 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:55.306 WARN  [84075:84075] Tried to add parameter matrix_dim => 1024, but it already exists [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestParameters.cpp:215] [TestParameters::AddDouble]
2023-12-07 16:05:57.747 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemtest.so: ./libMemtest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.748 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libEud.so: ./libEud.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Successfully loaded dlib libpluginCommon.so [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:270] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Skipping library libcupti.so because it matches libcupti.so in the skip list. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:251] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.748 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libTargetedPower.so: ./libTargetedPower.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.854 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libContextCreate.so: ./libContextCreate.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.854 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.855 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libPulseTest.so: ./libPulseTest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.855 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.855 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.855 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.856 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemoryBandwidth.so: ./libMemoryBandwidth.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.856 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.856 DEBUG [84075:84075] Skipping library libcurand.so because it matches libcurand.so in the skip list. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:251] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.857 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemory.so: ./libMemory.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.857 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.857 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.857 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.858 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libTargetedStress.so: ./libTargetedStress.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.858 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.894 ERROR [84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests. Check logs for plugin failures.. Error code:  [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/NvvsMain.cpp:83] [OutputMainError]
2023-12-07 16:05:57.894 ERROR [84075:84075] Global error mask is: 0x00000000000000000000000000000000000000000000000000000000000000 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/NvvsMain.cpp:84] [OutputMainError]

I have the same problem. I can't run a power test dcgmi diag -r targeted_power -p targeed_power.target_power=700.0 I have tested it properly with this command, but I restarted the machine and it no longer works

nikkon-dev commented 7 months ago

The ShutdownPlugin is not an error - if you enable the debug logs you'll see the message that missing function here is not an error.

The actual issue in your case is targeed_power - this is a typo in your arguments.

[84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests.

jiaxinonly commented 7 months ago

The ShutdownPlugin is not an error - if you enable the debug logs you'll see the message that missing function here is not an error.

The actual issue in your case is targeed_power - this is a typo in your arguments.

[84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests.

Damn it, it runs normally as soon as I wake up. I guarantee I didn’t write anything wrong before. It’s because I copied the command from the official website many times and tried to use the command that I used normally before, but it didn’t work. image

image Finally, thank you very much for your reply. It helps me a lot. I wish you a happy life.

jiaxinonly commented 7 months ago

image image

It appears again after restarting. I feel like some service has not been started. I am not very familiar with this thing.

nikkon-dev commented 7 months ago

I see that the plugin fails to initialize due to the error returned from the cudaDeviceGetByPCIBusId function. Is the nvidia-smi and nvidia-smi -q work on the system? That looks like a driver installation or a hardware problem.

jiaxinonly commented 7 months ago
root@h800:~# nvidia-smi 
Fri Dec  8 02:54:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H800                    On  | 00000000:1B:00.0 Off |                    0 |
| N/A   25C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H800                    On  | 00000000:1C:00.0 Off |                    0 |
| N/A   28C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H800                    On  | 00000000:41:00.0 Off |                    0 |
| N/A   27C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H800                    On  | 00000000:44:00.0 Off |                    0 |
| N/A   24C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H800                    On  | 00000000:87:00.0 Off |                    0 |
| N/A   24C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H800                    On  | 00000000:88:00.0 Off |                    0 |
| N/A   27C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H800                    On  | 00000000:C1:00.0 Off |                    0 |
| N/A   26C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H800                    On  | 00000000:C4:00.0 Off |                    0 |
| N/A   26C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@h800:~# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Fri Dec  8 02:55:32 2023
Driver Version                            : 535.129.03
CUDA Version                              : 12.2

Attached GPUs                             : 8
GPU 00000000:1B:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323523010507
    GPU UUID                              : GPU-4cc5fc69-2f7d-c781-0a5d-5394cb7feecb
    Minor Number                          : 0
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x1b00
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 5
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x1B
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:1B:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 621 KB/s
        Rx Throughput                     : 574 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 25 C
        GPU T.Limit Temp                  : 61 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 32 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 68.76 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 690.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:1C:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322523024800
    GPU UUID                              : GPU-d0b8b0fd-cf1a-de3c-c8d7-8134075b696c
    Minor Number                          : 1
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x1c00
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 7
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x1C
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:1C:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 617 KB/s
        Rx Throughput                     : 1351 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 28 C
        GPU T.Limit Temp                  : 59 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 34 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 69.59 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 685.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:41:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323523011088
    GPU UUID                              : GPU-08039d9e-51d5-9fd5-64c8-4333685d9304
    Minor Number                          : 2
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x4100
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 6
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x41
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:41:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 601 KB/s
        Rx Throughput                     : 574 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 27 C
        GPU T.Limit Temp                  : 60 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 35 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 68.42 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 670.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:44:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322623050690
    GPU UUID                              : GPU-ba3b0dcc-278f-0e5c-a270-9abc7026266e
    Minor Number                          : 3
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x4400
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 8
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x44
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:44:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 2835 KB/s
        Rx Throughput                     : 582 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 24 C
        GPU T.Limit Temp                  : 62 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 33 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 68.58 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 690.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:87:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323523010840
    GPU UUID                              : GPU-527e8fbd-3cce-2409-d888-64fac6fcdabe
    Minor Number                          : 4
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x8700
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x87
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:87:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 597 KB/s
        Rx Throughput                     : 535 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 25 C
        GPU T.Limit Temp                  : 62 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 33 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 68.98 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 675.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:88:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322623040489
    GPU UUID                              : GPU-0f0efb42-03a1-99b1-560b-382b2929c01d
    Minor Number                          : 5
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0x8800
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 3
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x88
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:88:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 593 KB/s
        Rx Throughput                     : 523 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 27 C
        GPU T.Limit Temp                  : 60 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 34 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 70.02 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 685.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:C1:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323523010167
    GPU UUID                              : GPU-fde99785-24eb-af85-128f-75d1fa90f9f9
    Minor Number                          : 6
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0xc100
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 2
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xC1
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:C1:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 589 KB/s
        Rx Throughput                     : 542 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU T.Limit Temp                  : 60 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 34 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 69.59 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 685.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None

GPU 00000000:C4:00.0
    Product Name                          : NVIDIA H800
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1323523010216
    GPU UUID                              : GPU-a481af5c-b615-6090-8889-3bd3c1b85d14
    Minor Number                          : 7
    VBIOS Version                         : 96.00.61.00.0B
    MultiGPU Board                        : No
    Board ID                              : 0xc400
    Board Part Number                     : 692-2G520-0205-000
    GPU Part Number                       : 2324-865-A1
    FRU Part Number                       : N/A
    Module ID                             : 4
    Inforom Version
        Image Version                     : G520.0205.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.129.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xC4
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x232410DE
        Bus Id                            : 00000000:C4:00.0
        Sub System Id                     : 0x17A610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 589 KB/s
        Rx Throughput                     : 578 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64 
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 4 MiB
        Free                              : 81003 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 2560 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU T.Limit Temp                  : 60 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 35 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 69.65 W
        Current Power Limit               : 700.00 W
        Requested Power Limit             : 700.00 W
        Default Power Limit               : 700.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 700.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 345 MHz
        SM                                : 345 MHz
        Memory                            : 2619 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Default Applications Clocks
        Graphics                          : 1980 MHz
        Memory                            : 2619 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1980 MHz
        SM                                : 1980 MHz
        Memory                            : 2619 MHz
        Video                             : 1545 MHz
    Max Customer Boost Clocks
        Graphics                          : 1980 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 685.000 mV
    Fabric
        State                             : In Progress
        Status                            : N/A
    Processes                             : None
jiaxinonly commented 7 months ago

A magical thing happened, it can run again. I barely did anything. I feel it is caused by some services taking too long to start. The problem is that my server is a brand new h800 and there are no other services. It should not be a performance problem. The startup is too long, I think there should be a sleep function. image

nikkon-dev commented 7 months ago

Can you please check the dmesg logs and see if there is any information about the nvswitches? If there is, can you tell me how long it took to retrain them?

jiaxinonly commented 7 months ago

image It will take about half an hour to be used normally. It is currently in operation and it is inconvenient to reproduce it.

jiaxinonly commented 7 months ago

I am doing some server cooling tests that need to run continuously. It seems that the dcmgi test power can only last for 8 hours.

dcgmi diag -r targeted_power -p targeted_power.test_duration=864000,targeted_power.target_power=700.0
Error: Unable to complete diagnostic for group 2147483647. Return: (-11) Timeout.
Error: Could not stop the launched diagnostic.

This is my monitoring data: image

nikkon-dev commented 6 months ago

@jiaxinonly,

Could you provide debug logs for nv-hostengine and nvvs for the timeout issue? You may need to rerun the nv-hostengine with the -f host.debug.log --log-level debug and run dcgmi diag -d DEBUG --debugLogFile diag.debug.log ....

It might be worth trying to run nvvs directly, which is often found in /usr/share/nvidia-validation-suite/nvvs, to see if the timeout occurs.

jiaxinonly commented 6 months ago

I will try my best to find time to test again and provide the log when the time comes.

nikkon-dev commented 6 months ago

@jiaxinonly,

The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.

Alternatively, you may use dcgmi diag --iterations N, to restart the diagnostic sequence N times.

jiaxinonly commented 6 months ago

@jiaxinonly,

The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.

Alternatively, you may use dcgmi diag --iterations N, to restart the diagnostic sequence N times.

Okay, thank you for your help. I used script loop execution and achieved the same effect. In addition, nv-hostengine -f host.debug.log --log-level debug command and dcgmi diag -r targeted_power -p targeted_power.target_power=300.0 -d DEBUG --debugLogFile diag.debug.log command log are provided here.

diag.debug.log host.debug.log

jiaxinonly commented 6 months ago
root@h800:~# dmesg | grep nvswitch
[   19.988348] nvidia-nvswitch: Probing device 0000:03:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000 
[   19.990588] nvidia-nvswitch 0000:03:00.0: enabling device (0140 -> 0142)
[   21.003607] nvidia-nvswitch0: using MSI
[   26.493658] nvidia-nvswitch: Probing device 0000:04:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000 
[   26.494856] nvidia-nvswitch 0000:04:00.0: enabling device (0140 -> 0142)
[   27.511546] nvidia-nvswitch1: using MSI
[   32.988655] nvidia-nvswitch: Probing device 0000:05:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000 
[   32.989173] nvidia-nvswitch 0000:05:00.0: enabling device (0140 -> 0142)
[   33.999224] nvidia-nvswitch2: using MSI
[   39.495645] nvidia-nvswitch: Probing device 0000:06:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000 
[   39.495971] nvidia-nvswitch 0000:06:00.0: enabling device (0140 -> 0142)
[   40.500469] nvidia-nvswitch3: using MSI
[   79.711708] nvidia-nvswitch1: open (major=509)
[   79.870101] nvidia-nvswitch3: open (major=509)
[   79.995964] nvidia-nvswitch2: open (major=509)
[   80.155076] nvidia-nvswitch0: open (major=509)
[  340.295360] nvidia-nvswitch0: release (major=509)
[  340.296102] nvidia-nvswitch2: release (major=509)
[  340.296775] nvidia-nvswitch3: release (major=509)
[  340.297128] nvidia-nvswitch1: release (major=509)
[  341.018299] nvidia-nvswitch1: open (major=509)
[  341.176787] nvidia-nvswitch3: open (major=509)
[  341.302842] nvidia-nvswitch2: open (major=509)
[  341.461760] nvidia-nvswitch0: open (major=509)
[  848.297912] nvidia-nvswitch0: release (major=509)
[  848.298641] nvidia-nvswitch2: release (major=509)
[  848.299371] nvidia-nvswitch3: release (major=509)
[  848.300091] nvidia-nvswitch1: release (major=509)
[ 1032.657668] nvidia-nvswitch1: open (major=509)
[ 1032.815956] nvidia-nvswitch3: open (major=509)
[ 1032.941899] nvidia-nvswitch2: open (major=509)
[ 1033.100662] nvidia-nvswitch0: open (major=509)
[ 1171.953468] nvidia-nvswitch0: release (major=509)
[ 1171.954148] nvidia-nvswitch2: release (major=509)
[ 1171.954776] nvidia-nvswitch3: release (major=509)
[ 1171.955174] nvidia-nvswitch1: release (major=509)
[ 1172.731207] nvidia-nvswitch1: open (major=509)
[ 1172.878793] nvidia-nvswitch3: open (major=509)
[ 1172.996268] nvidia-nvswitch2: open (major=509)
[ 1173.142339] nvidia-nvswitch0: open (major=509)
[ 1287.135077] nvidia-nvswitch0: release (major=509)
[ 1287.135753] nvidia-nvswitch2: release (major=509)
[ 1287.136386] nvidia-nvswitch3: release (major=509)
[ 1287.136878] nvidia-nvswitch1: release (major=509)
[ 1287.826520] nvidia-nvswitch1: open (major=509)
[ 1287.972503] nvidia-nvswitch3: open (major=509)
[ 1288.088632] nvidia-nvswitch2: open (major=509)
[ 1288.234621] nvidia-nvswitch0: open (major=509)
jiaxinonly commented 6 months ago

@jiaxinonly,

The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.

Alternatively, you may use dcgmi diag --iterations N, to restart the diagnostic sequence N times.

After testing,--iterations parameters cannot be continuously executed, because the overall timeout of the program exits.

root@h800:~# dcgmi diag -r targeted_power -p targeted_power.test_duration=864000,targeted_power.target_power=700.0 --iterations 8

Running iteration 1 of 8...
Error: Unable to complete diagnostic for group 2147483647. Return: (-11) Timeout.
Error: Could not stop the launched diagnostic.
Aborting the iterative runs of the diagnostic due to failure: Timeout
nikkon-dev commented 6 months ago

@jiaxinonly,

Please keep in mind that the parameter targeted_power.test_duration=864000 sets the duration of each test to ten days, with a timeout of 8 hours. However, this value should not exceed 28800. If you prefer, you can use the default value for the test duration and instead set --iterations=100500 to a large number of your choice (100500 is just a placeholder for any arbitrary big number).

jiaxinonly commented 6 months ago

Yeah, I didn't notice that. Thanks again for your reply