NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

too many warnings and errors #146

Open jelmd opened 3 years ago

jelmd commented 3 years ago

When running dcgm-exporter as a service, it logs on each scrap (in my case every 2s) stuff like this:

2021-01-04 15:34:28.095 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.095 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.096 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.130 ERROR [12466:17166] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.130 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.131 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.142 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.142 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.142 WARN  [12466:17166] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.151 ERROR [12466:17166] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.151 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.151 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.162 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.162 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.162 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.170 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.170 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.170 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.177 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.177 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.177 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.183 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.183 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.183 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.188 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.188 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.188 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.193 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.193 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.193 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.197 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.197 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.197 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.212 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.212 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.212 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.232 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.232 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.232 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.272 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.272 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.272 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]
2021-01-04 15:34:28.284 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.284 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.284 WARN  [12466:15949] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4254] [DcgmCacheManager::GetLatestSample]
2021-01-04 15:34:28.303 ERROR [12466:15949] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-01-04 15:34:28.303 ERROR [12466:14315] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-01-04 15:34:28.304 ERROR [12466:14946] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/profiling/DcgmModuleProfiling.cpp:1694] [DcgmModuleProfiling::ProcessCoreMessage]

Are there any intentions to fix it? Or should dcgm-exporter rather be seen as an hot experimental piece of SW, which should not be used in production?

treydock commented 3 years ago

I too have run into this issue running datacenter-gpu-manager-2.0.13-1.x86_64 RPM and dcgm-exporter built from commit 6860bc83e609eb0c4eba6b5eca7af6c02d50d3b3

jelmd commented 3 years ago

dcgm-exporter/nv-hostengine seems to be very poor software. Totally crap.

dualvtable commented 3 years ago

Hi @jelmd - thanks for reporting this issue. Can you please provide more information on what your system configuration is? Specifically:

  1. what is the underlying NVIDIA driver version?
  2. which tagged version of dcgm-exporter you're using
  3. what GPUs you're running on
  4. what host Linux distribution

This will help us investigate and respond better to your issue. Thanks

treydock commented 3 years ago

@dualvtable As I'm running into this same issue:

  1. 450.80.02
  2. Tag is 2.1.1
  3. In my case it's v100 and I think my p100 might have had same issue, two separate clusters but everything else is identical
  4. RHEL 7.7
jelmd commented 3 years ago
  1. 418.87.01
  2. 2.0.13 included in https://github.com/NVIDIA/gpu-monitoring-tools/archive/2.1.2.tar.gz
  3. 8 boxes with 8x GeForce RTX 2080 Ti, 1 box with 4x Tesla V100-SXM2-32GB
  4. Ubuntu 18.04
dualvtable commented 3 years ago

Great thanks - and when do you guys see this issue? Is that at container startup or do we start emitting these messages after a while? And one last question - did you observe this in prior releases? I'm trying to determine if this is a regression that we somehow introduced and missed in our test plan.

Thanks again.

treydock commented 3 years ago

Looking at our logs it looks like the messages happen every ~2 seconds, with many messages per iteration. We had to deploy logrotate to aggressively rotate /var/log/nv-hostengine.log as we had systems with their local disks getting filled up from this one log file. I do not recall this being an issue prior to the 2.x release but the exact version when this began with exporter is unclear to me. Looking at yum RPM history for DCGM it looks like when we upgraded to 2.1.1 of the exporter we also upgraded to datacenter-gpu-manager RPM 2.0.13. Prior to 2.0.13 it looks like we had 1.7.2 with the version of exporter is less clear because our initial deployment was a fork. The commit I authored and was running locally was ed27d32dd12000dc10f78e4aea632367eae54c17, merged in merge request 25 on gitlab I believe.

jelmd commented 3 years ago

Same thing here - produces trash all the time. Log size is ~ 4640 B/s, so ~ 383 MiB/d =8-(

llewxam-kache commented 3 years ago

Chiming in, while troubleshooting another issue, I see that my nv-hostengine.log is massive! 14.2 GiB nv-hostengine.log filled with:

2021-03-05 15:07:49.246 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.246 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:49.248 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.248 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.248 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.249 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.249 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:49.251 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.251 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.251 WARN  [1508:1552] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:49.253 ERROR [1508:1552] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/core/DcgmModuleCore.cpp:45] [DcgmModuleCore::ProcessMessage]
2021-03-05 15:07:49.253 ERROR [1508:2145] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_0-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:341] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-03-05 15:07:51.235 WARN  [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:51.236 WARN  [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
2021-03-05 15:07:51.236 WARN  [1508:2461] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_0-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3929] [DcgmCacheManager::GetLatestSample]
nikkon-dev commented 3 years ago

Hi, could you also provide the set of metrics you are collecting using the dcgm-exporter?

treydock commented 3 years ago

The files we use for the exporter were copied from this repo at the time when the exporter was deployed:

Cluster with V100s having this issue:

[root@pitzer-rw02 ~]# cat /etc/dcgm-exporter/dcp-metrics-included.csv 
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

Cluster with P100 having this issue:

[root@owens-rw01 ~]# cat /etc/dcgm-exporter/default-counters.csv 
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes
jelmd commented 3 years ago

FWIW: replaced dcgm-exporter incl. datacenter-gpu-manager on 10 boxes with nvmex and got rid off ~ 500 GB logfiles.

nvmex is a KISSed agent written in C and thus needs less resources, in my case ~ 6 vs. 24 MiB RSS, ~ 120 MB vs. 6 GB VSZ, CPU usage ~ 12+-5% vs. 45+-15%, on average ~ 25W less/box. We do not use MIG, so for now this part and less useful/static data NVML provides are ignored/not fetched.

Just in case you wanna try it out on Ubuntu: Download and install nvmex-10 or nvmex-11 as well as libprom. They probably work on other linux distros as well if libmicrohttpd.so.12 is installed.

IsQiao commented 3 years ago

i have save issue, /var/log/nv-hostengine.log have much warnings and errors. it cause this log file size upper to 15GB!. GPU model is TESLA T4

2021-07-15 08:17:23.624 ERROR [1:24] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:23.625 ERROR [1:24] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:53.623 WARN  [1:13] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4315] [DcgmCacheManager::GetLatestSample]
2021-07-15 08:17:53.623 ERROR [1:13] Error: unable to retrieve PCIe topology information: Feature not supported [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1616] [DcgmHostEngineHandler::ProcessGetTopologyIO]
2021-07-15 08:17:53.624 ERROR [1:13] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]
2021-07-15 08:17:53.624 ERROR [1:20] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:356] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-07-15 08:17:53.624 ERROR [1:20] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:387] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
2021-07-15 08:17:53.624 ERROR [1:22] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1816] [DcgmModuleProfiling::ProcessCoreMessage]
2021-07-15 08:17:53.624 ERROR [1:13] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:17:53.624 ERROR [1:13] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:18:23.623 WARN  [1:96] Fixing entityGroupId for global field [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4315] [DcgmCacheManager::GetLatestSample]
2021-07-15 08:18:23.624 ERROR [1:96] Error: unable to retrieve PCIe topology information: Feature not supported [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1616] [DcgmHostEngineHandler::ProcessGetTopologyIO]
2021-07-15 08:18:23.624 ERROR [1:20] ReadNvSwitchStatusAllSwitches() returned No data is available [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:387] [DcgmNs::DcgmModuleNvSwitch::RunOnce]
2021-07-15 08:18:23.624 ERROR [1:96] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]
2021-07-15 08:18:23.624 ERROR [1:20] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:356] [DcgmNs::DcgmModuleNvSwitch::ProcessCoreMessage]
2021-07-15 08:18:23.625 ERROR [1:22] Unknown subcommand: 3 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1816] [DcgmModuleProfiling::ProcessCoreMessage]
2021-07-15 08:18:23.626 ERROR [1:96] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
2021-07-15 08:18:23.626 ERROR [1:96] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4395] [DcgmCacheManager::GetMultipleLatestSamples]
treydock commented 3 years ago

Based on feedback from NVIDIA I set the following environment variable to silence the extra logging:

__DCGM_DBG_LVL=NONE

Now the only logs I get in /var/log/nv-hostengine.log is 1 or 2 messages every 30 seconds.