NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

dcgm diag pcie test hangs indefinitely for H100 80GB HBM3 #136

Closed disjustin closed 6 months ago

disjustin commented 7 months ago

I'm encountering an error on an NVIDIA H100 80GB HBM3 system where the program hangs indefinitely. The debug log outputs several errors, and I can't confirm if they're related. I'll include two dcgm diag log files from a runlevel 4 test and pcie only:

dcgmi diag -r 4

nvvs.log

dcgmi diag -r pcie --debugLogFile /mnt/diag.log

diag.log

First of diag debug log file

2023-11-29 16:27:36.609 ERROR [74919:74919] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-11-29 16:27:36.610 ERROR [74919:74919] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]

121 - Not sure what to make of this issue, but here's a list of installed packages: rpm_qa.txt

Last of diag debug log file. Why does the program hang after outputting this message? May this be an issue related to nvlink vs pcie?

2023-11-29 16:28:46.895 DEBUG [75153:75153] [[pcie]] For now, binding to NUMA nodes only runs in pinned mode. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:976] [outputHostDeviceBandwidthMatrix]
2023-11-29 16:28:48.547 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (0) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.549 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (1) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.551 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (2) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.551 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (3) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.554 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (4) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.555 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (5) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.557 INFO  [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (6) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]

Thank you ahead of time!

disjustin commented 6 months ago

Issue resolved in #137 in release v3.3.1. This particular 8-GPU system was missing one GPU. From DCGM Release Notes: