Open nguoido opened 8 months ago
@nguoido,
If you enable the debug logs, you should see the following message after those errors: Plugin does not have a ShutdownPlugin function. This is not an error.
.
Those are not actual errors; we will improve the module loading in the future so that new APIs do not cause such false errors.
2023-12-07 16:05:54.702 DEBUG [84075:84075] No MIG devices are configured on GPU 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.748 DEBUG [84075:84075] No MIG devices are configured on GPU 1 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.835 DEBUG [84075:84075] No MIG devices are configured on GPU 2 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:54.923 DEBUG [84075:84075] No MIG devices are configured on GPU 3 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.012 DEBUG [84075:84075] No MIG devices are configured on GPU 4 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.099 DEBUG [84075:84075] No MIG devices are configured on GPU 5 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.188 DEBUG [84075:84075] No MIG devices are configured on GPU 6 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.275 DEBUG [84075:84075] No MIG devices are configured on GPU 7 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/Gpu.cpp:183] [Gpu::IsMigModeDiagCompatible]
2023-12-07 16:05:55.279 DEBUG [84075:84075] The following Cuda version will be used for plugins: 12.2 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:217] [TestFramework::GetPluginDirExtension]
2023-12-07 16:05:55.279 DEBUG [84075:84075] Searching /usr/share/nvidia-validation-suite/plugins for plugins. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:187] [TestFramework::GetPluginBaseDir]
2023-12-07 16:05:55.280 DEBUG [84075:84075] Successfully loaded dlib libpluginCommon.so [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:270] [TestFramework::LoadLibrary]
2023-12-07 16:05:55.306 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libPcie.so: ./libPcie.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:55.306 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:55.306 WARN [84075:84075] Tried to add parameter matrix_dim => 1024, but it already exists [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestParameters.cpp:215] [TestParameters::AddDouble]
2023-12-07 16:05:57.747 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemtest.so: ./libMemtest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.748 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libEud.so: ./libEud.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Successfully loaded dlib libpluginCommon.so [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:270] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Skipping library libcupti.so because it matches libcupti.so in the skip list. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:251] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.748 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libTargetedPower.so: ./libTargetedPower.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.748 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.854 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libContextCreate.so: ./libContextCreate.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.854 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.855 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libPulseTest.so: ./libPulseTest.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.855 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.855 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.855 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.856 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemoryBandwidth.so: ./libMemoryBandwidth.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.856 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.856 DEBUG [84075:84075] Skipping library libcurand.so because it matches libcurand.so in the skip list. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/TestFramework.cpp:251] [TestFramework::LoadLibrary]
2023-12-07 16:05:57.857 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libMemory.so: ./libMemory.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.857 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.857 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.857 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.858 ERROR [84075:84075] Couldn't load a definition for ShutdownPlugin in plugin libTargetedStress.so: ./libTargetedStress.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-12-07 16:05:57.858 DEBUG [84075:84075] Plugin does not have a ShutdownPlugin function. This is not an error. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/PluginLib.cpp:172] [PluginLib::LoadPlugin]
2023-12-07 16:05:57.894 ERROR [84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests. Check logs for plugin failures.. Error code: [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/NvvsMain.cpp:83] [OutputMainError]
2023-12-07 16:05:57.894 ERROR [84075:84075] Global error mask is: 0x00000000000000000000000000000000000000000000000000000000000000 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/NvvsMain.cpp:84] [OutputMainError]
I have the same problem. I can't run a power test dcgmi diag -r targeted_power -p targeed_power.target_power=700.0 I have tested it properly with this command, but I restarted the machine and it no longer works
The ShutdownPlugin is not an error - if you enable the debug logs you'll see the message that missing function here is not an error.
The actual issue in your case is targeed_power - this is a typo in your arguments.
[84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests.
The ShutdownPlugin is not an error - if you enable the debug logs you'll see the message that missing function here is not an error.
The actual issue in your case is targeed_power - this is a typo in your arguments.
[84075:84075] Got runtime_error: Invalid Parameter String: test 'targeed_power' does not match any loaded tests.
Damn it, it runs normally as soon as I wake up. I guarantee I didn’t write anything wrong before. It’s because I copied the command from the official website many times and tried to use the command that I used normally before, but it didn’t work.
Finally, thank you very much for your reply. It helps me a lot. I wish you a happy life.
It appears again after restarting. I feel like some service has not been started. I am not very familiar with this thing.
I see that the plugin fails to initialize due to the error returned from the cudaDeviceGetByPCIBusId
function.
Is the nvidia-smi
and nvidia-smi -q
work on the system? That looks like a driver installation or a hardware problem.
root@h800:~# nvidia-smi
Fri Dec 8 02:54:48 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H800 On | 00000000:1B:00.0 Off | 0 |
| N/A 25C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H800 On | 00000000:1C:00.0 Off | 0 |
| N/A 28C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H800 On | 00000000:41:00.0 Off | 0 |
| N/A 27C P0 68W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H800 On | 00000000:44:00.0 Off | 0 |
| N/A 24C P0 68W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H800 On | 00000000:87:00.0 Off | 0 |
| N/A 24C P0 68W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H800 On | 00000000:88:00.0 Off | 0 |
| N/A 27C P0 70W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H800 On | 00000000:C1:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H800 On | 00000000:C4:00.0 Off | 0 |
| N/A 26C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@h800:~# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Fri Dec 8 02:55:32 2023
Driver Version : 535.129.03
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:1B:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323523010507
GPU UUID : GPU-4cc5fc69-2f7d-c781-0a5d-5394cb7feecb
Minor Number : 0
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x1b00
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 5
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1B
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:1B:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 621 KB/s
Rx Throughput : 574 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 25 C
GPU T.Limit Temp : 61 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 32 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 68.76 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 690.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:1C:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322523024800
GPU UUID : GPU-d0b8b0fd-cf1a-de3c-c8d7-8134075b696c
Minor Number : 1
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x1c00
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 7
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x1C
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:1C:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 617 KB/s
Rx Throughput : 1351 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU T.Limit Temp : 59 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 34 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 69.59 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 685.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:41:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323523011088
GPU UUID : GPU-08039d9e-51d5-9fd5-64c8-4333685d9304
Minor Number : 2
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x4100
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 6
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 601 KB/s
Rx Throughput : 574 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 27 C
GPU T.Limit Temp : 60 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 35 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 68.42 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 670.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:44:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322623050690
GPU UUID : GPU-ba3b0dcc-278f-0e5c-a270-9abc7026266e
Minor Number : 3
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x4400
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 8
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x44
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:44:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 2835 KB/s
Rx Throughput : 582 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 24 C
GPU T.Limit Temp : 62 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 33 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 68.58 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 690.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:87:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323523010840
GPU UUID : GPU-527e8fbd-3cce-2409-d888-64fac6fcdabe
Minor Number : 4
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x8700
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x87
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:87:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 597 KB/s
Rx Throughput : 535 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 25 C
GPU T.Limit Temp : 62 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 33 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 68.98 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 675.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:88:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1322623040489
GPU UUID : GPU-0f0efb42-03a1-99b1-560b-382b2929c01d
Minor Number : 5
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0x8800
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 3
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x88
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:88:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 593 KB/s
Rx Throughput : 523 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 27 C
GPU T.Limit Temp : 60 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 34 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 70.02 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 685.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:C1:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323523010167
GPU UUID : GPU-fde99785-24eb-af85-128f-75d1fa90f9f9
Minor Number : 6
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0xc100
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 2
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xC1
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:C1:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 589 KB/s
Rx Throughput : 542 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU T.Limit Temp : 60 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 34 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 69.59 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 685.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
GPU 00000000:C4:00.0
Product Name : NVIDIA H800
Product Brand : NVIDIA
Product Architecture : Hopper
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323523010216
GPU UUID : GPU-a481af5c-b615-6090-8889-3bd3c1b85d14
Minor Number : 7
VBIOS Version : 96.00.61.00.0B
MultiGPU Board : No
Board ID : 0xc400
Board Part Number : 692-2G520-0205-000
GPU Part Number : 2324-865-A1
FRU Part Number : N/A
Module ID : 4
Inforom Version
Image Version : G520.0205.00.02
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.129.03
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : No
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xC4
Device : 0x00
Domain : 0x0000
Device Id : 0x232410DE
Bus Id : 00000000:C4:00.0
Sub System Id : 0x17A610DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 589 KB/s
Rx Throughput : 578 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 81559 MiB
Reserved : 551 MiB
Used : 4 MiB
Free : 81003 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 2560 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 26 C
GPU T.Limit Temp : 60 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -2 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 35 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Power Draw : 69.65 W
Current Power Limit : 700.00 W
Requested Power Limit : 700.00 W
Default Power Limit : 700.00 W
Min Power Limit : 200.00 W
Max Power Limit : 700.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 345 MHz
SM : 345 MHz
Memory : 2619 MHz
Video : 765 MHz
Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Default Applications Clocks
Graphics : 1980 MHz
Memory : 2619 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1980 MHz
SM : 1980 MHz
Memory : 2619 MHz
Video : 1545 MHz
Max Customer Boost Clocks
Graphics : 1980 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 685.000 mV
Fabric
State : In Progress
Status : N/A
Processes : None
A magical thing happened, it can run again. I barely did anything. I feel it is caused by some services taking too long to start. The problem is that my server is a brand new h800 and there are no other services. It should not be a performance problem. The startup is too long, I think there should be a sleep function.
Can you please check the dmesg logs and see if there is any information about the nvswitches? If there is, can you tell me how long it took to retrain them?
It will take about half an hour to be used normally. It is currently in operation and it is inconvenient to reproduce it.
I am doing some server cooling tests that need to run continuously. It seems that the dcmgi test power can only last for 8 hours.
dcgmi diag -r targeted_power -p targeted_power.test_duration=864000,targeted_power.target_power=700.0
Error: Unable to complete diagnostic for group 2147483647. Return: (-11) Timeout.
Error: Could not stop the launched diagnostic.
This is my monitoring data:
@jiaxinonly,
Could you provide debug logs for nv-hostengine and nvvs for the timeout issue? You may need to rerun the nv-hostengine with the -f host.debug.log --log-level debug
and run dcgmi diag -d DEBUG --debugLogFile diag.debug.log ...
.
It might be worth trying to run nvvs
directly, which is often found in /usr/share/nvidia-validation-suite/nvvs
, to see if the timeout occurs.
I will try my best to find time to test again and provide the log when the time comes.
@jiaxinonly,
The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.
Alternatively, you may use dcgmi diag --iterations N
, to restart the diagnostic sequence N times.
@jiaxinonly,
The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.
Alternatively, you may use
dcgmi diag --iterations N
, to restart the diagnostic sequence N times.
Okay, thank you for your help. I used script loop execution and achieved the same effect.
In addition, nv-hostengine -f host.debug.log --log-level debug
command and dcgmi diag -r targeted_power -p targeted_power.target_power=300.0 -d DEBUG --debugLogFile diag.debug.log
command log are provided here.
root@h800:~# dmesg | grep nvswitch
[ 19.988348] nvidia-nvswitch: Probing device 0000:03:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
[ 19.990588] nvidia-nvswitch 0000:03:00.0: enabling device (0140 -> 0142)
[ 21.003607] nvidia-nvswitch0: using MSI
[ 26.493658] nvidia-nvswitch: Probing device 0000:04:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
[ 26.494856] nvidia-nvswitch 0000:04:00.0: enabling device (0140 -> 0142)
[ 27.511546] nvidia-nvswitch1: using MSI
[ 32.988655] nvidia-nvswitch: Probing device 0000:05:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
[ 32.989173] nvidia-nvswitch 0000:05:00.0: enabling device (0140 -> 0142)
[ 33.999224] nvidia-nvswitch2: using MSI
[ 39.495645] nvidia-nvswitch: Probing device 0000:06:00.0, Vendor Id = 0x10de, Device Id = 0x22a3, Class = 0x68000
[ 39.495971] nvidia-nvswitch 0000:06:00.0: enabling device (0140 -> 0142)
[ 40.500469] nvidia-nvswitch3: using MSI
[ 79.711708] nvidia-nvswitch1: open (major=509)
[ 79.870101] nvidia-nvswitch3: open (major=509)
[ 79.995964] nvidia-nvswitch2: open (major=509)
[ 80.155076] nvidia-nvswitch0: open (major=509)
[ 340.295360] nvidia-nvswitch0: release (major=509)
[ 340.296102] nvidia-nvswitch2: release (major=509)
[ 340.296775] nvidia-nvswitch3: release (major=509)
[ 340.297128] nvidia-nvswitch1: release (major=509)
[ 341.018299] nvidia-nvswitch1: open (major=509)
[ 341.176787] nvidia-nvswitch3: open (major=509)
[ 341.302842] nvidia-nvswitch2: open (major=509)
[ 341.461760] nvidia-nvswitch0: open (major=509)
[ 848.297912] nvidia-nvswitch0: release (major=509)
[ 848.298641] nvidia-nvswitch2: release (major=509)
[ 848.299371] nvidia-nvswitch3: release (major=509)
[ 848.300091] nvidia-nvswitch1: release (major=509)
[ 1032.657668] nvidia-nvswitch1: open (major=509)
[ 1032.815956] nvidia-nvswitch3: open (major=509)
[ 1032.941899] nvidia-nvswitch2: open (major=509)
[ 1033.100662] nvidia-nvswitch0: open (major=509)
[ 1171.953468] nvidia-nvswitch0: release (major=509)
[ 1171.954148] nvidia-nvswitch2: release (major=509)
[ 1171.954776] nvidia-nvswitch3: release (major=509)
[ 1171.955174] nvidia-nvswitch1: release (major=509)
[ 1172.731207] nvidia-nvswitch1: open (major=509)
[ 1172.878793] nvidia-nvswitch3: open (major=509)
[ 1172.996268] nvidia-nvswitch2: open (major=509)
[ 1173.142339] nvidia-nvswitch0: open (major=509)
[ 1287.135077] nvidia-nvswitch0: release (major=509)
[ 1287.135753] nvidia-nvswitch2: release (major=509)
[ 1287.136386] nvidia-nvswitch3: release (major=509)
[ 1287.136878] nvidia-nvswitch1: release (major=509)
[ 1287.826520] nvidia-nvswitch1: open (major=509)
[ 1287.972503] nvidia-nvswitch3: open (major=509)
[ 1288.088632] nvidia-nvswitch2: open (major=509)
[ 1288.234621] nvidia-nvswitch0: open (major=509)
@jiaxinonly,
The dcgmi diag has a hardcoded 8-hour timeout in the communication protocol.
Alternatively, you may use
dcgmi diag --iterations N
, to restart the diagnostic sequence N times.
After testing,--iterations parameters cannot be continuously executed, because the overall timeout of the program exits.
root@h800:~# dcgmi diag -r targeted_power -p targeted_power.test_duration=864000,targeted_power.target_power=700.0 --iterations 8
Running iteration 1 of 8...
Error: Unable to complete diagnostic for group 2147483647. Return: (-11) Timeout.
Error: Could not stop the launched diagnostic.
Aborting the iterative runs of the diagnostic due to failure: Timeout
@jiaxinonly,
Please keep in mind that the parameter targeted_power.test_duration=864000
sets the duration of each test to ten days, with a timeout of 8 hours. However, this value should not exceed 28800. If you prefer, you can use the default value for the test duration and instead set --iterations=100500
to a large number of your choice (100500 is just a placeholder for any arbitrary big number).
Yeah, I didn't notice that. Thanks again for your reply
When I run dcgmi diag -r 4, I get this issue. But I find libDiagnostic.so, libSoftware.so ... at /usr/share/nvidia-validation-suite/plugins/cuda12 or /usr/share/nvidia-validation-suite/plugins/cuda11. Can you help me this issue?
2023-10-26 12:30:41.636 ERROR [4190:4190] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2] 2023-10-26 12:30:41.636 ERROR [4190:4190] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
023-10-26 12:30:42.825 ERROR [4190:4190] Couldn't load a definition for ShutdownPlugin in plugin libDiagnostic.so: ./libDiagnostic.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]
2023-10-26 12:30:42.826 ERROR [4190:4190] Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so: ./libSoftware.so: undefined symbol: ShutdownPlugin [/workspaces/dcgm-rel_dcgm_3_2-postmerge@5/nvvs/src/PluginLib.cpp:237] [PluginLib::LoadFunction]