NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8 #159

Open SomePersonSomeWhereInTheWorld opened 3 months ago

SomePersonSomeWhereInTheWorld commented 3 months ago

Using:

cuda-dcgm-libs-3.1.3.1-198_cm9.2.x86_64
cuda-dcgm-nvvs-3.1.3.1-198_cm9.2.x86_64
cuda-dcgm-3.1.3.1-198_cm9.2.x86_64

The 'cm' stands for "Cluster Manager" as in Nvidia Bright Computing (now called Base Command).

The /var/log/nv-hostengine.log is filling up with these entries every few seconds: 2024-04-02 12:54:06.828 ERROR [1264985:1264994] [[NvSwitch]] Not attached to NvSwitches. Aborting [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:967] [DcgmNs::DcgmNvSwitchManager::ReadNvSwitchStatusAllSwitches]

In /etc/dcgm.envwe have: __DCGM_DBG_LVL=NONE

That seems to have quieted these logs: ERROR [5450:5462] Got more than DCGM_MAX_CLOCKS supported clocks. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:11130] [DcgmCacheManager::AppendDeviceSupportedClocks]

These are the same errors from this DCGM Exporter bug.

nikkon-dev commented 3 months ago

@SomePersonSomeWhereInTheWorld,

Can you confirm if the system has NvSwitches and if the correct version of the libnvidia-nscq package is installed?

SomePersonSomeWhereInTheWorld commented 3 months ago

Well we load this as a module that Nvidia Bright Computing supplies. All I see is:

find /cm/local/apps/cuda-dcgm
name *vswitch*
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3
/cm/local/apps/cuda-dcgm/3.1.3.1/lib64/libdcgmmodulenvswitch.so.3.1.3

And no sign of libnvidia-nscq.

nikkon-dev commented 3 months ago

@SomePersonSomeWhereInTheWorld,

The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

I'm asking this, because on the system without nvswitches, those logs should not be written more than once. But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.

SomePersonSomeWhereInTheWorld commented 3 months ago

The libnvidia-nscq is not a part of the DCGM - that's a library required for NVSwitch / Fabricmanager to work correctly. You could find a proper package, for example, here: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

We're on RHEL so I see this yum packaging page

But if DCGM detects nvswitches in the system, it tries to enumerate them again and again, but without a proper nscq library (whose version should precisely match the installed driver), it fails to initialize, thus growing error logs.

We are on NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1. I don't see a version 530.30.02 in the tar archives. Or is there a different driver version you are referring to?

nikkon-dev commented 3 months ago

@SomePersonSomeWhereInTheWorld

Could you try this? https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/libnvidia-nscq-530-530.30.02-1.x86_64.rpm

kmittman commented 3 months ago

Hi 530.30.02 shipped bundled with CUDA, so the binary archive tarball is here https://developer.download.nvidia.com/compute/cuda/redist/libnvidia_nscq/linux-x86_64/

SomePersonSomeWhereInTheWorld commented 3 months ago

OK the rpm worked! What's the proper configuration now that it's installed? Can you point me to some instructions ideally for RHEL?

nikkon-dev commented 3 months ago

@SomePersonSomeWhereInTheWorld,

The instructions are the same. You did not find the tarball initially because the compute/nvidia-driver location only has TRD drivers, and your installed driver is a developer driver that is only installed with Cuda SDK (thus, you need to get tarballs from compute/cuda instead). I will see if the documentation should be updated to be clearer about this.