Closed annapendleton closed 3 weeks ago
/gcbrun
Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper? Do we want to extend the runner metric capture to include this also?
Is this just to make sure any infra spun up by the terraform has these metrics, so it can get picked up by the prom scraper? Do we want to extend the runner metric capture to include this also?
Yee, for this PR it's mainly scoped to that - these metrics aren't being scraped for any infra run, and we want them to be.
The runner currently only captures GPU utilization IIRC, vs all of the related DCGM metrics captured at the DCGM exporter layer.
For what we should include in the runner - not all of these metrics are immediately useful for analysis. I think it's a great idea to add in the useful ones - eg. memory usage and power usage seem to be 2 important ones in our autoscaling discussions more recently. I'm thinking it's a good idea to add those in a follow up PR :)
Small change to add a few missing DCGM metrics