Closed stvnwrgs closed 9 months ago
Is there any use of a push-based mechanism for metrics, as opposed to the traditional scraping method?
For push based metrics, we recommend using an OTel sidecar, more details can be found here: https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar
This sidecar is useful for pull based prometheus metrics, and you can find more details on how to run it here: https://cloud.google.com/run/docs/monitoring-managed-prometheus-sidecar
Is there a mechanism in place to ensure that metrics are reliably delivered in a dynamically scaling environment?
Yes, this sidecar makes some changes to the Prometheus libraries so it can guarantee a datapoint being ingested per metric. No matter how short running the instance is or whether the Cloud Run instances are scaled up or down. It does this by ensuring the sidecar makes a final scrape before anything is shutdown. It also scrapes within 10s of the container start up, and has regular interval based scraping. It is also guaranteed that the sidecar shuts down before the application container is shut down.
Does the container have a process to ensure that metrics are collected before it is terminated during scaling down operations?
Yes. The sidecar container is terminated before the application container, and ensures there is a shutdown scrape when this happens. https://cloud.google.com/run/docs/monitoring-managed-prometheus-sidecar#sidecar-intro has some more information about how the container dependency is set up
Description
I have a question regarding the handling of metrics in dynamically scaling services, specifically related to Prometheus endpoints in the context of Google Cloud services.
Background
In typical scenarios, Prometheus endpoints are associated with long-running services that are periodically scraped for metrics. However, in a dynamic scaling environment where containers are frequently scaled up and down, this approach seems to pose a risk of data loss.
Questions
Documentation Gap
I could not find relevant information in the Google Cloud Documentation or in the existing issues here, leading me to raise this query.
Objective
Understanding the above mechanisms is crucial for designing a robust and data-loss-resistant monitoring setup in a dynamically scaling cloud environment.