aws / eks-charts

Amazon EKS Helm chart repository
Apache License 2.0
1.17k stars 922 forks source link

GPU metrics not collected by aws-cloudwatch-metrics #1097

Open claudio-vellage opened 2 months ago

claudio-vellage commented 2 months ago

Describe the bug

I've setup the aws-cloudwatch-metrics through the helm chart linked here, I've also set the image.tag=1.300037.0b583, because it seems that the GPU metrics should be collected by default starting from 1.300034.0 according to this link.

Also the RBAC permissions have been manually updated to include services: https://github.com/aws/eks-charts/pull/1095 as well as I've explicitly set enhancedContainerInsights.enabled=true (and fixed the documentation for this value here).

I still can't see the metrics in ContainerInsights and I start to believe, that I have to add additional settings to the ConfigMap to explicitly enable the GPU metrics collection. Can someone confirm this, or should GPU metrics collection would out of the box?

Steps to reproduce

Install aws-cloudwatch-metrics on a EKS cluster with GPU nodes (e.g. g5.xlarge). Check CloudWatch for GPU metrics.

Expected outcome

I'd expect the GPU metrics to show up in CloudWatch

Environment

Additional Context:

I've successfully set up the metrics collection for GPU metrics on EC2 instances before, but it doesn't seem to work on EKS using this chart.