NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter missing many metrics after upgrade #143

Open huww98 opened 3 years ago

huww98 commented 3 years ago

I've updated our dcgm-exporter deployed directly in docker to tag 2.0.13-2.1.2-ubuntu20.04, but many metrics are missing.

It only exports 18 metrics, compared with 34 in tag 1.7.2. Is this expected? or it is a bug?

This is the command we use:

docker run -d --gpus all -p 9400:9400 --name dcgm-exporter --restart unless-stopped nvidia/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04

The following metrics are missing. I do see them enabled in default-counters.csv though.

DCGM_FI_DEV_MEMORY_TEMP
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION
DCGM_FI_DEV_LOW_UTIL_VIOLATION
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL
DCGM_FI_DEV_RETIRED_SBE
DCGM_FI_DEV_RETIRED_DBE
DCGM_FI_DEV_RETIRED_PENDING
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL
shatil commented 3 years ago

After upgrading from 2.0.0-rc12 to 2.1.2 (building from source using the tags in the Git repo), I'm missing these:

DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_RX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT
DCGM_FI_DEV_PCIE_TX_THROUGHPUT

The rest appear to be there, but I haven't really compared the values to see if they end up in the same ballpark.

asaulys commented 3 years ago

based upon: https://github.com/NVIDIA/gpu-monitoring-tools/compare/2.0.0-rc.12...master ... there are some changes related to metrics.... filtering zero values and masking others 'based on constant'... worthy looking into these to see if they're causing the missing metrics. iirc there were a few that would never display real values...

jfolz commented 3 years ago

One of our machines "involuntarily" updated the dcgm exporter docker image and we're now missing some metrics like DCGM_FI_DEV_GPU_UTIL, which is kind of crucial.

Here's the full list:

DCGM_FI_DEV_GPU_UTIL 
DCGM_FI_DEV_POWER_VIOLATION
DCGM_FI_DEV_THERMAL_VIOLATION   
DCGM_FI_DEV_SYNC_BOOST_VIOLATION        
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION   
DCGM_FI_DEV_LOW_UTIL_VIOLATION  
DCGM_FI_DEV_RELIABILITY_VIOLATION
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL   
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL   
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL   
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL   
DCGM_FI_DEV_RETIRED_SBE 
DCGM_FI_DEV_RETIRED_DBE 
DCGM_FI_DEV_RETIRED_PENDING     
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL   
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL   
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL 
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL

We also gained these:

DCGM_FI_DEV_VGPU_LICENSE_STATUS
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
DCGM_FI_DEV_ROW_REMAP_FAILURE
nikkon-dev commented 3 years ago

Thank you for using the dcgm-exporter project and reporting this issue. We are sad to hear your scenarios were negatively affected by our changes. Unfortunately, we deliberately made the changes to the set of the enabled by default metrics. I’d recommend you to provide your .csv configuration file with only those metrics that you need and use. There were several reasons behind this change: 1). The previous set of metrics included ECC metrics that are very expensive to collect. We got multiple complaints about 100% CPU core utilization, mainly because of those ECC metrics. We also found out that most users do not provide their .csv configuration files with the only necessary metrics. Instead, users filtered out or just not used the metrics that were not interesting. That behavior leads to significant performance hits, as every metric collection is not free and needs significant GPU resources to collect. 2). Some metrics, previously enabled by default, are deprecated and should be replaced with new ones. For example, DCGM_FI_DEV_GPU_UTIL should be replaced with DCGM_FI_PROF_GR_ENGINE_ACTIVE, or DCGM_FI_PROF_SM_ACTIVE or DCGM_FI_PROF_SM_OCCUPANCY, based on your needs; DCGM_FI_DEVPCIE{RX/TX}_THROUGHPUT may be replaced with DCGM_FI_PROFPCIE{RX/TX}_BYTES; ECC metrics that are very CPU heavy may be replaced with DCGM_FI_DEV_XID_ERRORS. 3) The previous default metrics set had almost all DCGM_FI_PROF* metrics enabled, which created unnecessary load on GPUs – not all PROF metrics could be collected in a single pass together.

Considering all the above, we changed the default .csv configuration file and kept only a basic set of metrics that would not made unnecessary load on users’ systems. And we urge you to provide your .csv configuration files with carefully selected metrics that you need to monitor. We have not deleted the metrics themselves, so you can get the previous metrics ignoring my recommendations about deprecated ones.

jfolz commented 3 years ago

@nikkon-dev thanks for the update. For the meantime we re-enabled DCGM_FI_DEV_GPU_UTIL via collectors config, but I fully agree that it's a bad metric that doesn't reflect actual utilization (GPU could be doing 1+1 over and over for all we know). Ideally I would like to transition to DCGM_FI_PROF_SM_OCCUPANCY as long as that does not incur a performance hit. Some advice on the performance impact of individual metrics would be very welcome :)

Effectively the issue we had was one of documentation. We deploy the dcgm-exporter docker image as a systemd service as defined by deepops. It pulls the newest image whenever the service starts - that's the root problem in my book and we're looking at options how to fix that. From our point of view metrics we needed just suddenly disappeared and we couldn't figure out on our own how to get them back. Looking through the commits it was version 2.3.0 that disabled DCGM_FI_DEV_GPU_UTIL in the default config. The release notes only say "Enable 1.x backwards compatibility, refactor default watch fields". A changelog or more complete release notes that actually reflect what was changed would have helped a lot.

mattf commented 3 years ago

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

nikkon-dev commented 3 years ago

@nikkon-dev what recommendation do you have for people using https://grafana.com/grafana/dashboards/12239 ?

@mattf, Thank you for pointing to that Grafana dashboard. I reached to the author, and we will update the dashboard according to the current set of enabled-by-default metrics. For the future, we want to research if such dashboards could be autogenerated based on the dcgm-exporter configuration.

anannaya commented 3 years ago

@nikkon-dev Do you have new updated dashboard ?

nikkon-dev commented 3 years ago

@anannaya,

We updated the dashboard to reflect the current state of the default dcgm-exporter configuration. Remember that it's not a robust solution, and any change in the set of enabled metrics may lead to a broken dashboard. We are considering a long-term solution, but for now, I would recommend specifying the CVS config explicitly and not rely on the default one that we provide as an example - we may decide to change it at any moment in the future.

WBR, Nik