The AWS Observability Helm Charts repository contains Helm charts to provide easy mechanisms to setup the CloudWatch Agent and other collection agents to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.
On a cluster of g5.8xlarge instances, the gpu metrics are exported successfully, but when switching to a cluster using p4d.24xlarge instances, the each DCGM exporter pod runs out of memory about every 10 minutes and is killed with code 137. This causes a gap in the gpu metrics published, but seems like it also be causing other pods using the GPUs to crash.
Hi! I am using v1.7.0.
On a cluster of g5.8xlarge instances, the gpu metrics are exported successfully, but when switching to a cluster using p4d.24xlarge instances, the each DCGM exporter pod runs out of memory about every 10 minutes and is killed with code 137. This causes a gap in the gpu metrics published, but seems like it also be causing other pods using the GPUs to crash.
Currently the memory limit for the DCGM exporter is hardcoded at 250Mi: https://github.com/aws-observability/helm-charts/blob/main/charts/amazon-cloudwatch-observability/templates/linux/dcgm-exporter-daemonset.yaml#L32 Raising this memory limit to 500Mi using kubectl patch resolves the issues I am seeing with my setup.
I think the memory limit set for the DCGM exporter should be made configurable via the Values file.
Best,