DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
932 stars 1.4k forks source link

dcgm-exporter README docker instructions contains incorrect commands and information #17370

Open mbacchi opened 6 months ago

mbacchi commented 6 months ago

The dcgm-exporter README.md has incorrect information about running dcgm-exporter in Docker. There are 2 major problems with these instructions which we would appreciate you fix.

  1. In the Docker section, you indicate that we should create a counters csv file with specific fields that you suggest should be used. Unfortunately using that counters file with the most recent version of the dcgm-exporter docker image (3.3.5-3.4.1) causes a segmentation violation:

    time="2024-04-09T21:14:41Z" level=info msg="Initializing system entities of type: CPU"
    SIGSEGV: segmentation violation

    If I provide no counters.csv file to the docker command it works fine. (For example using no -v argument in the recommended command in your step 2 here.)

  2. Again in your recommended docker run command, you suggest using -e DCGM_EXPORTER_INTERVAL=3 which tells dcgm-exporter to read GPU metrics every 3 milliseconds. This is apparently too fast, and causes high CPU usage, which I found out when I opened this issue in the dcgm-exporter repository. The default is -e DCGM_EXPORTER_INTERVAL=30000, which does not cause a high CPU usage problem on the system

These two issues cause the dcgm-exporter to be unusable due to your suggested commands and usage. Please fix this documentation.

steveny91 commented 3 weeks ago

@mbacchi Sorry for the delay. We recently addressed point 2 here: https://github.com/DataDog/integrations-core/pull/18658

Internally testing we saw some better stability in CPU usage with as you mentioned a higher exporter interval. But we wanted to try aligning the interval closer to what the Datadog agent scrapes on to help prevent stale data.

For point 1: I couldn't replicate this behavior with 3.3.8-3.6.0-ubuntu22.04 or 3.3.5-3.4.0-ubuntu22.04 for me they seem to spin up fine and I can't get the segfault you encountered:

time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"