Open bboreham opened 4 years ago
I believe current Prometheus only retries on 5XX errors, though perhaps it should also retry 429.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
I have been doing some thinking about this and asked around for some other ideas, then after looking at the RFC that covers 429, I realize it actually has an optional header for “Retry-After” so that seems like at least one avenue we should consider.
The code to decide what errors are retryable was added in https://github.com/prometheus/prometheus/pull/2552/files and now is at https://github.com/prometheus/prometheus/blob/4e5b1722b342948a55b3d7753f6539040db0e5f0/storage/remote/client.go#L200 . It only treats 5xx as retryable. I think we should consider changing it to look for the "Retry-After" header and set that from Cortex or other remote storage providers of remote write so that we could give it a hint when to retry or not without having to send some less appropriate status code than 429.
We could also consider something like X-RateLimit, https://tools.ietf.org/id/draft-polli-ratelimit-headers-00.html
Agreed. Couple more points at https://github.com/cortexproject/cortex/pull/3654#issuecomment-763006285
I couldn't see an existing issue in the Prometheus repo so opened https://github.com/prometheus/prometheus/issues/8418
I left the link in the issue Bryan opened as well, but there is an open PR for adding the Retry-After functionality to Prometheus that @Harkishen-Singh has been working on: https://github.com/prometheus/prometheus/pull/8237.
The necessary part is now implemented in Prometheus https://github.com/prometheus/prometheus/pull/8237 (and made optional in https://github.com/prometheus/prometheus/pull/8477).
I'll leave this open until Cortex sends the retry-after
header to make it slow down an appropriate amount.
Is this still in roadmap?
Currently, when the ingest rate limit is hit for a tenant, Cortex goes immediately from accepting all data to dropping all data. The dropped data is lost permanently because the mechanism is to return a 500 error.
I propose Cortex should have an inbetween state where the return value indicates that the sender should slow down, but does not cause the data to be dropped. For instance it could return 429 - current Prometheus will retry immediately, but it could be enhanced to slow down. Further, Cortex could return an indication of how close to the limit the tenant is.
This idea is not original - I have seen similar in DynamoDB, CosmosDB and GitHub.
I think this would solve #837, probably better than #879.