Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Prometheus-operator stackdriver sidecar sharding events #233

Open dgdevops opened 4 years ago

dgdevops commented 4 years ago

I am using service monitor k8s resources to add targets to Prometheus. I keep receiving metrics in Stackdriver from the sidecar until I add a service monitor to my k8s cluster that adds 220 targets to my prometheus, once the targets come up ALL metrics in stackdriver stop at the same time and no new metric values appear in Stackdriver. Based on the sidecar container logs shard calculation takes place :

level=debug ts=2020-04-30T08:51:20.975Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=9.107276804519778e-05 upperBound=1.1
level=debug ts=2020-04-30T08:51:35.975Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.028438730446968884 samplesOut=0.035548413058711106 samplesOutDuration=27897.643824423412 timePerSample=784778.8810810816 sizeRate=70059.18401954918 offsetRate=260863.64812517414 desiredShards=7.020667105478262e-05

This keeps going for hours and hours but the metrics do not return to Stackdriver. Could you please help in understanding the sharding? Additionally, how could I speed up the process?

Thanks

jmacd commented 3 years ago

I strongly suspect this is due to particular data points causing an unrecoverable error that looks recoverable. This requires some kind of never-succeeding request to explain, but the sidecar logic absolutely can fall into a permanent retry loop and block the WAL reader when this happens. Documented in the downstream repository

https://github.com/lightstep/opentelemetry-prometheus-sidecar/issues/88

also partly mitigated:

https://github.com/lightstep/opentelemetry-prometheus-sidecar/pulls/87

jmacd commented 3 years ago

This is the function that never returns:

// sendSamples to the remote storage with backoff for recoverable errors.
func (s *shardCollection) sendSamplesWithBackoff(client StorageClient, samples []*monitoring_pb.TimeSeries) {
    backoff := s.qm.cfg.MinBackoff
    for {
        begin := time.Now()
        err := client.Store(&monitoring_pb.CreateTimeSeriesRequest{TimeSeries: samples})

        sentBatchDuration.WithLabelValues(s.qm.queueName).Observe(time.Since(begin).Seconds())
        if err == nil {
            succeededSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
            return
        }

        if _, ok := err.(recoverableError); !ok {
            level.Warn(s.qm.logger).Log("msg", "Unrecoverable error sending samples to remote storage", "err", err)
            break
        }
        time.Sleep(time.Duration(backoff))
        backoff = backoff * 2
        if backoff > s.qm.cfg.MaxBackoff {
            backoff = s.qm.cfg.MaxBackoff
        }
    }

    failedSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
}
varun-krishna commented 3 years ago

I see the same behaviour with the same messages from the stack-driver sidecar

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.00173154100250915 samplesOut=0.00173154100250915 samplesOutDuration=5557.854004867412 timePerSample=3.2097732579324483e+06 sizeRate=4890.771316463715 offsetRate=2.134860677902194 desiredShards=0.019098805764792753

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.019098805764792753 upperBound=1.1