Cortex should tell remote write clients to slow down on rate-limiting

cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

https://cortexmetrics.io/

Apache License 2.0

5.46k stars 794 forks source link

Cortex should tell remote write clients to slow down on rate-limiting #2037

Open bboreham opened 4 years ago

bboreham commented 4 years ago

Currently, when the ingest rate limit is hit for a tenant, Cortex goes immediately from accepting all data to dropping all data. The dropped data is lost permanently because the mechanism is to return a 500 error.

I propose Cortex should have an inbetween state where the return value indicates that the sender should slow down, but does not cause the data to be dropped. For instance it could return 429 - current Prometheus will retry immediately, but it could be enhanced to slow down. Further, Cortex could return an indication of how close to the limit the tenant is.

This idea is not original - I have seen similar in DynamoDB, CosmosDB and GitHub.

I think this would solve #837, probably better than #879.

csmarchbanks commented 4 years ago

I believe current Prometheus only retries on 5XX errors, though perhaps it should also retry 429.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

ranton256 commented 3 years ago

I have been doing some thinking about this and asked around for some other ideas, then after looking at the RFC that covers 429, I realize it actually has an optional header for “Retry-After” so that seems like at least one avenue we should consider.

The code to decide what errors are retryable was added in https://github.com/prometheus/prometheus/pull/2552/files and now is at https://github.com/prometheus/prometheus/blob/4e5b1722b342948a55b3d7753f6539040db0e5f0/storage/remote/client.go#L200 . It only treats 5xx as retryable. I think we should consider changing it to look for the "Retry-After" header and set that from Cortex or other remote storage providers of remote write so that we could give it a hint when to retry or not without having to send some less appropriate status code than 429.

We could also consider something like X-RateLimit, https://tools.ietf.org/id/draft-polli-ratelimit-headers-00.html

bboreham commented 3 years ago

Agreed. Couple more points at https://github.com/cortexproject/cortex/pull/3654#issuecomment-763006285

I couldn't see an existing issue in the Prometheus repo so opened https://github.com/prometheus/prometheus/issues/8418

csmarchbanks commented 3 years ago

I left the link in the issue Bryan opened as well, but there is an open PR for adding the Retry-After functionality to Prometheus that @Harkishen-Singh has been working on: https://github.com/prometheus/prometheus/pull/8237.

bboreham commented 3 years ago

The necessary part is now implemented in Prometheus https://github.com/prometheus/prometheus/pull/8237 (and made optional in https://github.com/prometheus/prometheus/pull/8477).

I'll leave this open until Cortex sends the retry-after header to make it slow down an appropriate amount.

mvadu commented 2 years ago

Is this still in roadmap?