We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.
When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.
Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:
We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.
When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.
Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:
At this point, there was no other option than to restart the whole pod (prometheus-server + stackdriver).
Is there anything we're missing? Is this situation recoverable other than by restarting the pod (and losing all unsent metrics)?