NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
843 stars 151 forks source link

nvlink metrics are not available on the gh200 gpu node #336

Open AnjirwalaAnuj opened 3 months ago

AnjirwalaAnuj commented 3 months ago

Ask your question

I am running dcgm-exporter within docker container on a GH200 gpu node. However, the dcgm-exporter is not able to discover NvSwitch and NvLink devices and as a result doesn't export any NvLink metrics. I'm using nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 dcgm-exporter image which is latest. Does this latest version of dcgm-exporter support NvLink metrics on GH200 gpu node? If yes, is there any extra configuration required to get NvLink metrics?

Below is the dcgm-exporter container logs:

sudo docker logs dcgm-exporter 2024/05/31 03:42:29 maxprocs: Leaving GOMAXPROCS=72: CPU quota undefined time="2024-05-31T03:42:29Z" level=info msg="Starting dcgm-exporter" time="2024-05-31T03:42:29Z" level=info msg="DCGM successfully initialized!" time="2024-05-31T03:42:29Z" level=info msg="Collecting DCP Metrics" time="2024-05-31T03:42:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: GPU" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: NvSwitch" time="2024-05-31T03:42:29Z" level=info msg="Not collecting NvSwitch metrics; no switches to monitor" time="2024-05-31T03:42:29Z" level=info msg="Initializing system entities of type: NvLink" time="2024-05-31T03:42:29Z" level=info msg="Not collecting NvLink metrics; no switches to monitor"

glowkey commented 3 months ago

Note that nvswitches and nvlinks may not automatically be mounted inside the container. See https://github.com/NVIDIA/dcgm-exporter/issues/316#issuecomment-2087369233

AnjirwalaAnuj commented 3 months ago

Thank you for your reply. I tried mounting nvswitches and nvlinks devices to the dcgm-exporter container by following https://github.com/NVIDIA/dcgm-exporter/issues/169#issuecomment-1604771610. However, I don't see nvidia-nvswitch* and nvidia-nvlink device files under /dev directory on GH200 nodes. I also tried running dcgm-exporter binary (built from source) and it still couldn't discover the nvswitches and nvlinks.