aws-observability / helm-charts

The AWS Observability Helm Charts repository contains Helm charts to provide easy mechanisms to setup the CloudWatch Agent and other collection agents to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.
Apache License 2.0
9 stars 17 forks source link

[DO NOT MERGE] update dcgm image to the latest and fix dcgm pod crashing with OOM #66

Closed movence closed 4 months ago

movence commented 4 months ago

Description of changes:

Observing the memory consumption by DCGM pods with a mixed combination of nodes of different sizes (g4dn.12xl, g5.12xl, p3-16xl, p3-8xl and p3-2xl), memory utilizations seem to stabilize around ~230MB with the latest DCGM exporter image.

kubectl top pods -n amazon-cloudwatch --sort-by memory | grep dcgm | head -10
dcgm-exporter-bhpnz                                               1m           230Mi
dcgm-exporter-gw7gc                                               1m           229Mi
dcgm-exporter-jfgzx                                               1m           229Mi
dcgm-exporter-vmv2x                                               1m           229Mi
dcgm-exporter-x792q                                               1m           228Mi
dcgm-exporter-5pfcn                                               1m           228Mi
dcgm-exporter-d6kv7                                               1m           228Mi
dcgm-exporter-7s28r                                               2m           228Mi
dcgm-exporter-6vn76                                               2m           228Mi
dcgm-exporter-b269q                                               4m           228Mi

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.