Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Indefinitely-blocked sidecar problems #270

Closed jmacd closed 3 years ago

jmacd commented 3 years ago

There are two ways that the Stackdriver sidecar can be blocked indefinitely. Note that Lightstep has an OpenTelemetry fork this code and has observed the issue. See the downstream repository issue https://github.com/lightstep/opentelemetry-prometheus-sidecar/issues/88.

The problem has to start with an indefinitely blocked send. If there is a particular data point that returns a recoverable error repeatably, then the sidecar may fail to progress in two ways. This might happen if any one point always times out on the server, for example.

With that condition in mind, note that (*shardCollection).sendSamplesWithBackoff() will retry forever. This can block the sidecar in two ways:

  1. The reshard operation tries to stop the running shards before starting new ones. If the shard that is writing a point never exits, the new shards can't start. (Note the OTLP fork does not require in-order writes, so has dispensed with this potential deadlock.)
  2. The prometheus reader may become blocked on the shard that is stuck with or without a resharding event, because the queue has limited capacity. Once this event occurs, the sidecar is effectively deadlocked. (The OTLP fork has an emergency mitigation for this risk, using a timeout to avoid any single point from blocking the sidecar indefinitely. Note that this is not a perfect mitigation, because once the reader becomes blocked, the resharding process begins reducing the number of shards.)

Ultimately, this is not a problem as long as the target service never fails in this kind of corner case. Reporting a new issue to get more attention, possible duplicate of #233.

jmacd commented 3 years ago

I consider this an informational report.