NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
393 stars 50 forks source link

dcgm nvlink metrics not available on dcgm 3.1.3 #119

Open luccabb opened 11 months ago

luccabb commented 11 months ago

the nvidia-dcgm doc says that metrics like DCGM_FI_PROF_NVLINK_L{id}_TX_BYTES should be avaible on dcgm 3.1

I'm getting the following error when trying to query them (from dcgmi 3.1.3):

$ dcgmi dmon -d 100 -e 1040
#Entity   NVL0T                       
ID                                    
Error setting watches. Result: -6: Feature not supported
$ dcgmi -v | grep Version
Version : 3.1.3
Version : 3.1.3

is it expected? am I missing intermediate steps to enable the metrics?

dbeer commented 11 months ago

@luccabb what is the output of nvidia-smi? What GPU generation are you using?

luccabb commented 11 months ago

@dbeer

What GPU generation are you using?

NVIDIA A100-SXM4-40GB

luccabb commented 11 months ago

what is the output of nvidia-smi?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0 
...
luccabb commented 6 months ago

per https://github.com/NVIDIA/DCGM/issues/149#issuecomment-1922398817 its only available on Hopper+ GPUs

surfacing this on the dcgm docs would be helpful

cc: @dbeer @nikkon-dev