The AWS Observability Helm Charts repository contains Helm charts to provide easy mechanisms to setup the CloudWatch Agent and other collection agents to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.
Apache License 2.0
9
stars
17
forks
source link
[DO NOT MERGE] update dcgm image to the latest and fix dcgm pod crashing with OOM #66
Update the DCGM Exporter image to the latest version 3.3.6-3.4.2-ubuntu22.04
Increase the memory limit to 500MB for DCGM Exporter daemonset to fix OOM crashing issue (ExitCode 137)
Observing the memory consumption by DCGM pods with a mixed combination of nodes of different sizes (g4dn.12xl, g5.12xl, p3-16xl, p3-8xl and p3-2xl), memory utilizations seem to stabilize around ~230MB with the latest DCGM exporter image.
Description of changes:
3.3.6-3.4.2-ubuntu22.04
Observing the memory consumption by DCGM pods with a mixed combination of nodes of different sizes (g4dn.12xl, g5.12xl, p3-16xl, p3-8xl and p3-2xl), memory utilizations seem to stabilize around ~230MB with the latest DCGM exporter image.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.