DCGM Exporter Memory Limit Not Configurable

Hi! I am using v1.7.0.

On a cluster of g5.8xlarge instances, the gpu metrics are exported successfully, but when switching to a cluster using p4d.24xlarge instances, the each DCGM exporter pod runs out of memory about every 10 minutes and is killed with code 137. This causes a gap in the gpu metrics published, but seems like it also be causing other pods using the GPUs to crash.

Currently the memory limit for the DCGM exporter is hardcoded at 250Mi: https://github.com/aws-observability/helm-charts/blob/main/charts/amazon-cloudwatch-observability/templates/linux/dcgm-exporter-daemonset.yaml#L32 Raising this memory limit to 500Mi using kubectl patch resolves the issues I am seeing with my setup.

I think the memory limit set for the DCGM exporter should be made configurable via the Values file.

Best,

aws-observability / helm-charts

DCGM Exporter Memory Limit Not Configurable #61