Open jelmd opened 3 years ago
I too have run into this issue running datacenter-gpu-manager-2.0.13-1.x86_64
RPM and dcgm-exporter built from commit 6860bc83e609eb0c4eba6b5eca7af6c02d50d3b3
dcgm-exporter/nv-hostengine seems to be very poor software. Totally crap.
Hi @jelmd - thanks for reporting this issue. Can you please provide more information on what your system configuration is? Specifically:
dcgm-exporter
you're using This will help us investigate and respond better to your issue. Thanks
@dualvtable As I'm running into this same issue:
Great thanks - and when do you guys see this issue? Is that at container startup or do we start emitting these messages after a while? And one last question - did you observe this in prior releases? I'm trying to determine if this is a regression that we somehow introduced and missed in our test plan.
Thanks again.
Looking at our logs it looks like the messages happen every ~2 seconds, with many messages per iteration. We had to deploy logrotate to aggressively rotate /var/log/nv-hostengine.log as we had systems with their local disks getting filled up from this one log file. I do not recall this being an issue prior to the 2.x release but the exact version when this began with exporter is unclear to me. Looking at yum RPM history for DCGM it looks like when we upgraded to 2.1.1 of the exporter we also upgraded to datacenter-gpu-manager RPM 2.0.13. Prior to 2.0.13 it looks like we had 1.7.2 with the version of exporter is less clear because our initial deployment was a fork. The commit I authored and was running locally was ed27d32dd12000dc10f78e4aea632367eae54c17, merged in merge request 25 on gitlab I believe.
Same thing here - produces trash all the time. Log size is ~ 4640 B/s, so ~ 383 MiB/d =8-(
Chiming in, while troubleshooting another issue, I see that my nv-hostengine.log is massive! 14.2 GiB nv-hostengine.log filled with:
2021-03-05 15:07:49.246 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.246 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:49.248 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.248 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.248 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.249 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.249 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:49.251 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.251 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.251 WARN [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.253 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.253 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:51.235 WARN [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:51.236 WARN [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:51.236 WARN [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
Hi, could you also provide the set of metrics you are collecting using the dcgm-exporter?
The files we use for the exporter were copied from this repo at the time when the exporter was deployed:
Cluster with V100s having this issue:
[root@pitzer-rw02 ~]# cat /etc/dcgm-exporter/dcp-metrics-included.csv
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.
Cluster with P100 having this issue:
[root@owens-rw01 ~]# cat /etc/dcgm-exporter/default-counters.csv
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
FWIW: replaced dcgm-exporter incl. datacenter-gpu-manager on 10 boxes with nvmex and got rid off ~ 500 GB logfiles.
nvmex is a KISSed agent written in C and thus needs less resources, in my case ~ 6 vs. 24 MiB RSS, ~ 120 MB vs. 6 GB VSZ, CPU usage ~ 12+-5% vs. 45+-15%, on average ~ 25W less/box. We do not use MIG, so for now this part and less useful/static data NVML provides are ignored/not fetched.
Just in case you wanna try it out on Ubuntu: Download and install nvmex-10 or nvmex-11 as well as libprom. They probably work on other linux distros as well if libmicrohttpd.so.12
is installed.
i have save issue, /var/log/nv-hostengine.log
have much warnings and errors. it cause this log file size upper to 15GB!.
GPU model is TESLA T4
2021-07-15 08:17:23.624 ERROR [1:24] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:23.625 ERROR [1:24] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:53.623 WARN [1:13] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4315] [DcgmCacheManager::GetLatestSample]
2021-07-15 08:17:53.623 ERROR [1:13] Error: unable to retrieve PCIe topology information: Feature not supported [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1616] [DcgmHostEngineHandler::ProcessGetTopologyIO]
2021-07-15 08:17:53.624 ERROR [1:13] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]
2021-07-15 08:17:53.624 ERROR [1:20] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:356] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-07-15 08:17:53.624 ERROR [1:20] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:387] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
2021-07-15 08:17:53.624 ERROR [1:22] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1816] [DcgmModuleProfiling::ProcessCoreMessage]
2021-07-15 08:17:53.624 ERROR [1:13] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:53.624 ERROR [1:13] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:18:23.623 WARN [1:96] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4315] [DcgmCacheManager::GetLatestSample]
2021-07-15 08:18:23.624 ERROR [1:96] Error: unable to retrieve PCIe topology information: Feature not supported [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1616] [DcgmHostEngineHandler::ProcessGetTopologyIO]
2021-07-15 08:18:23.624 ERROR [1:20] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:387] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
2021-07-15 08:18:23.624 ERROR [1:96] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]
2021-07-15 08:18:23.624 ERROR [1:20] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:356] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-07-15 08:18:23.625 ERROR [1:22] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1816] [DcgmModuleProfiling::ProcessCoreMessage]
2021-07-15 08:18:23.626 ERROR [1:96] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:18:23.626 ERROR [1:96] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
Based on feedback from NVIDIA I set the following environment variable to silence the extra logging:
__DCGM_DBG_LVL=NONE
Now the only logs I get in /var/log/nv-hostengine.log is 1 or 2 messages every 30 seconds.
When running dcgm-exporter as a service, it logs on each scrap (in my case every 2s) stuff like this:
Are there any intentions to fix it? Or should
dcgm-exporter
rather be seen as an hot experimental piece of SW, which should not be used in production?