Open mbacchi opened 6 months ago
@mbacchi Sorry for the delay. We recently addressed point 2 here: https://github.com/DataDog/integrations-core/pull/18658
Internally testing we saw some better stability in CPU usage with as you mentioned a higher exporter interval. But we wanted to try aligning the interval closer to what the Datadog agent scrapes on to help prevent stale data.
For point 1: I couldn't replicate this behavior with 3.3.8-3.6.0-ubuntu22.04
or 3.3.5-3.4.0-ubuntu22.04
for me they seem to spin up fine and I can't get the segfault you encountered:
time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-09-27T18:11:44Z" level=info msg="Initializing system entities of type: CPU Core"
time="2024-09-27T18:11:44Z" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
The dcgm-exporter README.md has incorrect information about running dcgm-exporter in Docker. There are 2 major problems with these instructions which we would appreciate you fix.
In the Docker section, you indicate that we should create a counters csv file with specific fields that you suggest should be used. Unfortunately using that counters file with the most recent version of the dcgm-exporter docker image (3.3.5-3.4.1) causes a segmentation violation:
If I provide no counters.csv file to the docker command it works fine. (For example using no
-v
argument in the recommended command in your step 2 here.)Again in your recommended
docker run
command, you suggest using-e DCGM_EXPORTER_INTERVAL=3
which tells dcgm-exporter to read GPU metrics every 3 milliseconds. This is apparently too fast, and causes high CPU usage, which I found out when I opened this issue in the dcgm-exporter repository. The default is-e DCGM_EXPORTER_INTERVAL=30000
, which does not cause a high CPU usage problem on the systemThese two issues cause the dcgm-exporter to be unusable due to your suggested commands and usage. Please fix this documentation.