I'm encountering an error on an NVIDIA H100 80GB HBM3 system where the program hangs indefinitely. The debug log outputs several errors, and I can't confirm if they're related. I'll include two dcgm diag log files from a runlevel 4 test and pcie only:
2023-11-29 16:27:36.609 ERROR [74919:74919] Could not read package diag config. Please ensure the datacanter-gpu-manager-config package is installed [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:218] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
2023-11-29 16:27:36.610 ERROR [74919:74919] Exception: bad file: /usr/share/nvidia-validation-suite/diag-skus.yaml [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/src/ConfigFileParser_v2.cpp:220] [DcgmNs::Nvvs::ConfigFileParser_v2::ConfigFileParser_v2]
121 - Not sure what to make of this issue, but here's a list of installed packages: rpm_qa.txt
Last of diag debug log file. Why does the program hang after outputting this message? May this be an issue related to nvlink vs pcie?
2023-11-29 16:28:46.895 DEBUG [75153:75153] [[pcie]] For now, binding to NUMA nodes only runs in pinned mode. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:976] [outputHostDeviceBandwidthMatrix]
2023-11-29 16:28:48.547 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (0) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.549 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (1) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.551 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (2) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.551 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (3) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.554 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (4) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.555 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (5) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
2023-11-29 16:28:48.557 INFO [75153:75153] [[pcie]] cudaDeviceDisablePeerAccess for device (6) returned error (705): peer access has not been enabled [/workspaces/dcgm-rel_dcgm_3_3-postmerge/nvvs/plugin_src/pcie/PcieMain.cpp:167] [disableP2P]
I'm encountering an error on an NVIDIA H100 80GB HBM3 system where the program hangs indefinitely. The debug log outputs several errors, and I can't confirm if they're related. I'll include two dcgm diag log files from a runlevel 4 test and pcie only:
nvvs.log
diag.log
First of diag debug log file
121 - Not sure what to make of this issue, but here's a list of installed packages: rpm_qa.txt
Last of diag debug log file. Why does the program hang after outputting this message? May this be an issue related to nvlink vs pcie?
Thank you ahead of time!