NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

No NVLINK activity on DGX-A100 320GB #149

Open itzsimpl opened 8 months ago

itzsimpl commented 8 months ago

We use dcgm-exporter 3.3.3-3.3.0, nv-hostengine & dcgmi 3.3.3, nvidia drivers 535.154.05, DGXOS6 on DGX-A100 320GB. The csv contains

DCGM_FI_PROF_NVLINK_TX_BYTES,                    gauge, The rate of data not including protocol headers transmitted over NVLink (in B/s).
DCGM_FI_PROF_NVLINK_RX_BYTES,                    gauge, The rate of data not including protocol headers received over NVLink (in B/s).

however, the exporter always returns 0:

# curl -s localhost:9400/metrics | grep NVLINK
# HELP DCGM_FI_PROF_NVLINK_TX_BYTES The rate of data not including protocol headers transmitted over NVLink (in B/s).
# TYPE DCGM_FI_PROF_NVLINK_TX_BYTES gauge
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="0",UUID="GPU-715daa1d-db6f-9e69-ab48-190158bd5360",device="nvidia0",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="1",UUID="GPU-02348a17-a825-300c-0336-48e33d0dadb2",device="nvidia1",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="2",UUID="GPU-fbd9a227-e473-b993-215f-8f39b3574fd0",device="nvidia2",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="3",UUID="GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e",device="nvidia3",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="4",UUID="GPU-2a15688f-4b5f-999c-48dc-e9ec78b78531",device="nvidia4",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="5",UUID="GPU-995a8ef3-32b6-2e07-be4f-ac9d0371a7f1",device="nvidia5",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="6",UUID="GPU-88981248-fa05-f000-d761-05c8de30c8c6",device="nvidia6",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_TX_BYTES{gpu="7",UUID="GPU-f7bbbbcd-f23c-ad4f-f27b-043995ee3fb8",device="nvidia7",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
# HELP DCGM_FI_PROF_NVLINK_RX_BYTES The rate of data not including protocol headers received over NVLink (in B/s).
# TYPE DCGM_FI_PROF_NVLINK_RX_BYTES gauge
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="0",UUID="GPU-715daa1d-db6f-9e69-ab48-190158bd5360",device="nvidia0",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="1",UUID="GPU-02348a17-a825-300c-0336-48e33d0dadb2",device="nvidia1",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="2",UUID="GPU-fbd9a227-e473-b993-215f-8f39b3574fd0",device="nvidia2",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="3",UUID="GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e",device="nvidia3",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="4",UUID="GPU-2a15688f-4b5f-999c-48dc-e9ec78b78531",device="nvidia4",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="5",UUID="GPU-995a8ef3-32b6-2e07-be4f-ac9d0371a7f1",device="nvidia5",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="6",UUID="GPU-88981248-fa05-f000-d761-05c8de30c8c6",device="nvidia6",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0
DCGM_FI_PROF_NVLINK_RX_BYTES{gpu="7",UUID="GPU-f7bbbbcd-f23c-ad4f-f27b-043995ee3fb8",device="nvidia7",modelName="NVIDIA A100-SXM4-40GB",Hostname="axa"} 0

dcgm as well

# dcgmi dmon -e 1011,1012
#Entity   NVLTX                       NVLRX                       
ID                                                                
GPU 7     0                           0                           
GPU 6     0                           0                           
GPU 5     0                           0                           
GPU 4     0                           0                           
GPU 3     0                           0                           
GPU 2     0                           0                           
GPU 1     0                           0                           
GPU 0     0                           0                           
GPU 7     0                           0                           
GPU 6     0                           0                           
GPU 5     0                           0                           
GPU 4     0                           0                           
GPU 3     0                           0                           
GPU 2     0                           0                           
GPU 1     0                           0                           
GPU 0     0                           0                           
GPU 7     0                           0                           
GPU 6     0                           0                           
GPU 5     0                           0                           
GPU 4     0                           0                           
GPU 3     0                           0                           
GPU 2     0                           0                           
GPU 1     0                           0                           
GPU 0     0                           0                           
GPU 7     0                           0                           
GPU 6     0                           0                           
GPU 5     0                           0                           
GPU 4     0                           0                           
GPU 3     0                           0                           
GPU 2     0                           0                           
GPU 1     0                           0                           
GPU 0     0                           0                           

At least on 18.1.2024 the data used to be there. Since then, there have been a couple of updates to packages, drivers, ..., and dcgm. image

itzsimpl commented 8 months ago

The system has the latest DGXOS 6.1, latest fw and all ubuntu updates applied.

In nv-hostengine.log I see the following errors

2024-02-01 14:12:01.205 ERROR [9596:9598] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@^V���^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:12:01.207 ERROR [9596:9598] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:12:01.207 ERROR [9596:9598] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:13.984 ERROR [9694:9696] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@F��H^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:26:13.986 ERROR [9694:9696] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:26:13.986 ERROR [9694:9696] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.641 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
...
2024-02-01 14:26:27.736 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9b60 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 15:06:56.596 ERROR [9694:9695] Received This request is serviced by a module of DCGM that is not currently loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:227] [DcgmHostEngineHandler::GetAllEntitiesOfEntityGroup]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] FieldId {1040} is not supported for GPU 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2709] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ConvertFieldIdsToMetricIds]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] Unable to reconfigure LOP metric watches for GpuId {0} [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2740] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ChangeWatchSt>
2024-02-01 15:15:17.819 ERROR [9694:9696] DCGM_PROFILING_SR_WATCH_FIELDS failed with -6 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3710] [DcgmHostEngineHandler::WatchFieldGroup]
nikkon-dev commented 8 months ago

From the logs I see that the DCGM_FI_PROF_NVLINK_L0_TX_BYTES (1040) field was used instead of DCGM_FI_PROF_NVLINK_TX_BYTES (1011): [[Profiling]] FieldId {1040} is not supported for GPU 0

The DCGM_FI_PROF_NVLINK_L0_TX_BYTES is only supported on Hopper+ GPUs.

itzsimpl commented 8 months ago

@nikkon-dev, I apologise, that was my bad; while preparing this issue, I ran (based on https://github.com/NVIDIA/DCGM/issues/119) the command

dcgmi dmon -d 100 -e 1040

and received the same output. Using -e 1011 and/or -e 1012 I receive data, but it is all the time 0 (which shouldn't be as I am running a dummy LLM training and that same training on an 8x A100 80GB PCIe + NVLink and DGX-H100, all with the same dcgm-exporter setup, shows NVLink massively in use).

This is what I see in the dcgm-exporter container logs

# docker logs docker.dcgm-exporter.service
time="2024-02-02T07:56:31Z" level=info msg="Starting dcgm-exporter"
time="2024-02-02T07:56:31Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2024-02-02T07:56:31Z" level=info msg="DCGM successfully initialized!"
time="2024-02-02T07:56:31Z" level=info msg="Collecting DCP Metrics"
time="2024-02-02T07:56:31Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-02-02T07:56:31Z" level=info msg="Initializing system entities of type: GPU"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting switch metrics: No fields to watch for device type: 3"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting link metrics: No fields to watch for device type: 6"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu core metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Pipeline starting"
time="2024-02-02T07:56:33Z" level=info msg="Starting webserver"
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400

The command used to run it is

/usr/bin/docker run --rm --gpus all --net host --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.0-ubuntu22.04 -r localhost:5555 -f /etc/dcgm-exporter/default-counters.csv

The file /etc/dcgm-exporter/default-counters.csv contains the DCGM_FI_PROF_NVLINK_TX_BYTES and DCGM_FI_PROF_NVLINK_RX_BYTES fields.

Let me know if you need me to collect more data.

nikkon-dev commented 8 months ago

I see you are running nv-hostengine on port 5555. Could you rerun it with -f host.debug.log --log-level debug arguments and provide the host.debug.log after the dcgm-exporter starts reporting metrics or after dcgmi dmon -e 1011 command? Could you also provide topology output from nvidia-smi?

itzsimpl commented 8 months ago

nvidia-smi topology

# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

The host.debug.log is attached. I obtained it by stopping the dcgm-exporter service, stopping nvidia-dcgm service, added the arguments to nv-hostengine, restarted the nvidia-dcgm service, restarted the dcgm-exporter service and then also ran the command in cli.

host.debug.log

itzsimpl commented 8 months ago

Weird. Post second restart of both services I've noticed that it started working again. So I did the following, modified the nv-hostengine service to include the log, rebooted the system, started the job (NVLINK data from dcgm-exporter is all 0, dcgmi ran from the cli shows all 0). This is in log boot_host.debug.log.zip. boot_host.debug.log.zip

Then I stopped both services and restarted them again, NVLINK data started being collected correctly both dcgm-exporter and dcgmi ran from the cli returned values other than 0. This is in log restart_host.debug.log.zip restart_host.debug.log.zip

What could be the cause (start-up order?) and how to resolve it?