Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Add flag to throttle ingestion requests per minute #259

Open sagar-infinitus-ai opened 3 years ago

sagar-infinitus-ai commented 3 years ago

We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.

When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.

Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:

QueueManager.updateShardsLoop
"Currently resharding, skipping"
QueueManager.calculateDesiredShards

At this point, there was no other option than to restart the whole pod (prometheus-server + stackdriver).

Is there anything we're missing? Is this situation recoverable other than by restarting the pod (and losing all unsent metrics)?