Make Prometheus TSDB metrics available for faster debugging of active usage

The Prometheus API endpoint - api/v1/tsdb/status exposes a bunch of metrics(see attached tsdb.txt) around label cardinality, active series per metrics, head series related metrics etc. At present, Amazon Managed Prometheus exposes only a few of these metrics - that too via Cloudwatch.

This adds to the pain of exporting metrics from Cloudwatch and putting back into AMP while these metrics could be easily made available in AMP itself with an amp_tsdb prefix.

Context:

Internally, we run a Prometheus Operator in our EKS cluster and push the metrics to AMP via Remote write. We suddenly start hitting 400 Bad request error when we reach limits. This leads to data loss. Presently, we don't have proper visibility into this due to limited metric data from Amazon Managed Prometheus. These metrics would help us fix that.

How could you do it?

Prometheus JSON Exporter can be run as a sidecar for each Cortex instance that you run. A static config can scrape these metrics and push it to AMP. These metrics can finally be aggregated via Recording rules within AMP and exposed as a final TSDB metrics thats workspace wide.

Hope to see this in real soon ! Happy to help with the implementation details - We have done it locally and works like a charm for Prometheus Operator setup.

tsdb.txt

aws / amazon-managed-service-for-prometheus-roadmap

Make Prometheus TSDB metrics available for faster debugging of active usage #24

Context:

How could you do it?