Open itzsimpl opened 8 months ago
The system has the latest DGXOS 6.1, latest fw and all ubuntu updates applied.
In nv-hostengine.log I see the following errors
2024-02-01 14:12:01.205 ERROR [9596:9598] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@^V���^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:12:01.207 ERROR [9596:9598] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:12:01.207 ERROR [9596:9598] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:13.984 ERROR [9694:9696] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@F��H^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 14:26:13.986 ERROR [9694:9696] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 14:26:13.986 ERROR [9694:9696] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.640 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 14:26:27.641 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9a90 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
...
2024-02-01 14:26:27.736 ERROR [9694:14170] [[NvSwitch]] NSCQ field Id 0 passed error -5 for device 0x9c9b60 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/nvswitch/DcgmNvSwitchManager.cpp:1102] [DcgmNs::DcgmNvSwitchManager::ReadLinkStatesAllSwitches]
2024-02-01 15:06:56.596 ERROR [9694:9695] Received This request is serviced by a module of DCGM that is not currently loaded [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:227] [DcgmHostEngineHandler::GetAllEntitiesOfEntityGroup]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] FieldId {1040} is not supported for GPU 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2709] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ConvertFieldIdsToMetricIds]
2024-02-01 15:15:17.731 ERROR [9694:14440] [[Profiling]] Unable to reconfigure LOP metric watches for GpuId {0} [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:2740] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ChangeWatchSt>
2024-02-01 15:15:17.819 ERROR [9694:9696] DCGM_PROFILING_SR_WATCH_FIELDS failed with -6 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3710] [DcgmHostEngineHandler::WatchFieldGroup]
From the logs I see that the DCGM_FI_PROF_NVLINK_L0_TX_BYTES (1040)
field was used instead of DCGM_FI_PROF_NVLINK_TX_BYTES (1011)
: [[Profiling]] FieldId {1040} is not supported for GPU 0
The DCGM_FI_PROF_NVLINK_L0_TX_BYTES is only supported on Hopper+ GPUs.
@nikkon-dev, I apologise, that was my bad; while preparing this issue, I ran (based on https://github.com/NVIDIA/DCGM/issues/119) the command
dcgmi dmon -d 100 -e 1040
and received the same output. Using -e 1011
and/or -e 1012
I receive data, but it is all the time 0 (which shouldn't be as I am running a dummy LLM training and that same training on an 8x A100 80GB PCIe + NVLink and DGX-H100, all with the same dcgm-exporter setup, shows NVLink massively in use).
This is what I see in the dcgm-exporter container logs
# docker logs docker.dcgm-exporter.service
time="2024-02-02T07:56:31Z" level=info msg="Starting dcgm-exporter"
time="2024-02-02T07:56:31Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2024-02-02T07:56:31Z" level=info msg="DCGM successfully initialized!"
time="2024-02-02T07:56:31Z" level=info msg="Collecting DCP Metrics"
time="2024-02-02T07:56:31Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-02-02T07:56:31Z" level=info msg="Initializing system entities of type: GPU"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting switch metrics: No fields to watch for device type: 3"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting link metrics: No fields to watch for device type: 6"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-02-02T07:56:33Z" level=info msg="Not collecting cpu core metrics: Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-02T07:56:33Z" level=info msg="Pipeline starting"
time="2024-02-02T07:56:33Z" level=info msg="Starting webserver"
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-02-02T07:56:33.674Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400
The command used to run it is
/usr/bin/docker run --rm --gpus all --net host --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.0-ubuntu22.04 -r localhost:5555 -f /etc/dcgm-exporter/default-counters.csv
The file /etc/dcgm-exporter/default-counters.csv contains the DCGM_FI_PROF_NVLINK_TX_BYTES
and DCGM_FI_PROF_NVLINK_RX_BYTES
fields.
Let me know if you need me to collect more data.
I see you are running nv-hostengine on port 5555. Could you rerun it with -f host.debug.log --log-level debug
arguments and provide the host.debug.log after the dcgm-exporter starts reporting metrics or after dcgmi dmon -e 1011
command?
Could you also provide topology output from nvidia-smi?
nvidia-smi topology
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
The host.debug.log is attached. I obtained it by stopping the dcgm-exporter service, stopping nvidia-dcgm service, added the arguments to nv-hostengine, restarted the nvidia-dcgm service, restarted the dcgm-exporter service and then also ran the command in cli.
Weird. Post second restart of both services I've noticed that it started working again. So I did the following, modified the nv-hostengine service to include the log, rebooted the system, started the job (NVLINK data from dcgm-exporter is all 0, dcgmi ran from the cli shows all 0). This is in log boot_host.debug.log.zip
.
boot_host.debug.log.zip
Then I stopped both services and restarted them again, NVLINK data started being collected correctly both dcgm-exporter and dcgmi ran from the cli returned values other than 0. This is in log restart_host.debug.log.zip
restart_host.debug.log.zip
What could be the cause (start-up order?) and how to resolve it?
We use dcgm-exporter 3.3.3-3.3.0, nv-hostengine & dcgmi 3.3.3, nvidia drivers 535.154.05, DGXOS6 on DGX-A100 320GB. The csv contains
however, the exporter always returns 0:
dcgm as well
At least on 18.1.2024 the data used to be there. Since then, there have been a couple of updates to packages, drivers, ..., and dcgm.