Open SomePersonSomeWhereInTheWorld opened 3 months ago
@SomePersonSomeWhereInTheWorld,
Can you confirm if the system has NvSwitches and if the correct version of the libnvidia-nscq package is installed?
Well we load this as a module that Nvidia Bright Computing supplies. All I see is:
find /cm/local/apps/cuda-dcgm
name *vswitch*
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3.1.3
And no sign of libnvidia-nscq.
@SomePersonSomeWhereInTheWorld,
The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/
I'm asking this, because on the system without nvswitches, those logs should not be written more than once. But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.
The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/
We're on RHEL so I see this yum packaging page
But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.
We are on NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1. I don't see a version 530.30.02 in the tar archives. Or is there a different driver version you are referring to?
@SomePersonSomeWhereInTheWorld
Could you try this? https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/libnvidia-nscq-530-530.30.02-1.x86_64.rpm
Hi 530.30.02
shipped bundled with CUDA, so the binary archive tarball is here https://developer.download.nvidia.com/compute/cuda/redist/libnvidia_nscq/linux-x86_64/
OK the rpm worked! What's the proper configuration now that it's installed? Can you point me to some instructions ideally for RHEL?
@SomePersonSomeWhereInTheWorld,
The instructions are the same. You did not find the tarball initially because the compute/nvidia-driver location only has TRD drivers, and your installed driver is a developer driver that is only installed with Cuda SDK (thus, you need to get tarballs from compute/cuda instead). I will see if the documentation should be updated to be clearer about this.
Using:
The '
cm
' stands for "Cluster Manager" as in Nvidia Bright Computing (now called Base Command).The /var/log/nv-hostengine.log is filling up with these entries every few seconds:
2024-04-02 12:54:06.828 ERROR [1264985:1264994] [[NvSwitch]] Not attached to NvSwitches. Aborting [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:967] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]
In
/etc/dcgm.env
we have:__DCGM_DBG_LVL=NONE
That seems to have quieted these logs:
ERROR [5450:5462] Got more than DCGM_MAX_CLOCKS supported clocks. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:11130] [DcgmCacheManager::AppendDeviceSupportedClocks]
These are the same errors from this DCGM Exporter bug.