GoogleCloudPlatform / ai-on-gke

AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
Apache License 2.0
186 stars 140 forks source link

add missing dcgm metrics #710

Closed annapendleton closed 3 weeks ago

annapendleton commented 3 weeks ago

Small change to add a few missing DCGM metrics

annapendleton commented 3 weeks ago

/gcbrun

kfswain commented 3 weeks ago

Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper? Do we want to extend the runner metric capture to include this also?

annapendleton commented 3 weeks ago

Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper? Do we want to extend the runner metric capture to include this also?

Yee, for this PR it's mainly scoped to that - these metrics aren't being scraped for any infra run, and we want them to be.

The runner currently only captures GPU utilization IIRC, vs all of the related DCGM metrics captured at the DCGM exporter layer.

For what we should include in the runner - not all of these metrics are immediately useful for analysis. I think it's a great idea to add in the useful ones - eg. memory usage and power usage seem to be 2 important ones in our autoscaling discussions more recently. I'm thinking it's a good idea to add those in a follow up PR :)