Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Add option to specify how much in the past to tail #200

Closed r1ckr closed 4 years ago

r1ckr commented 4 years ago

Hey, we are deploying this into our current prometheus cluster which has data for the last 15days. This hits the limit of our quota and incur in charges.

Would be possible to add to the sidecar the option to export only from the last X amount of time?

qingling128 commented 4 years ago

Hi @r1ckr,

The sidecar is designed to ingest time series as it comes in. export only from the last X amount of time might not be an option. Prometheus Server wise, you can configure duration and retention by https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects.

BTW, which quota are you hitting?

r1ckr commented 4 years ago

Hey @qingling128. I am hitting the Stackdriver API calls limit in GCP.

The issue is that we are deploying this sidecar beside of our current prometheus server, which has a data retention of 15 days, the sidecar starts sending data older than the max amount of hours in the past that Stackdriver allows int GCP, so we start getting a lot of denials until we hit the quota limits and then nothing gets pushed to Stackdriver.

qingling128 commented 4 years ago

Hi @r1ckr,

Do you mean you are hitting Stackdriver Monitoring API request per minute write quota? How much time series are we sending per minutes? If the number of time series in one minute is large enough, we will hit this quota. Reducing data retention would not help avoid this case though, because sidecar always send time series as soon as possible. What we really need might be a throttle feature.

Taking one step back, it is rare for the number of time series to hit Stackdriver Monitoring API request per minute write quota. It's worth double checking the metrics scraping targets and verify how many time series we are sending. Then request an API quota limit if that's indeed necessary.

r1ckr commented 4 years ago

So, if we had a new instance of prometheus, this would work OK, since it will only have like 5min of data. But since we have 15days worth of data, the sidecar starts sending data from 15days ago until now. But Stackdriver doesn't let us send data older than 25hrs, resulting in us consuming all our API quota for data that never even got there.

By the way, I started this thread because of this line in the sidecar logs: level=info ts=2019-11-28T22:05:04.917Z caller=manager.go:215 component="Prometheus reader" msg="reached first record after start offset" start_offset=217693673 skipped_records=0

This is the error the sidecar throws after we hit the quota limit: level=warn ts=2019-11-28T22:07:34.478Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Time series ingestion requests' and limit 'Time series ingestion requests per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:__OUR_PROJECT_NUMBER_HERE__'."

qingling128 commented 4 years ago

Hi @r1ckr, Ah I see what the issue is. What is your Prometheus configuration for the following fields? --storage.tsdb.max-block-duration --storage.tsdb.min-block-duration --storage.tsdb.retention

I assume the retention is set to 15 days. What about block duration? The earliest metrics sidecar traces back to should always be less than the block duration. By design, sidecar only tails the most recent block.

r1ckr commented 4 years ago

Thanks for coming back @qingling128, that's it then, the 3 values are set to default:

storage.tsdb.max-block-duration | 36h
storage.tsdb.min-block-duration | 2h
storage.tsdb.retention          | 15d

That's why, so it is actually sending data from 36h hours ago and not from 15days ago, right?

This makes sense, we would need to configure this to be shorter than 25h to not go over the quota limits so fast.

Will check that and confirm.

qingling128 commented 4 years ago

Yeah, 36 hours ago should be when it started to send metrics for.

qingling128 commented 4 years ago

Seems like we don't need changes to the sidecar at this point. Closing this one for now.