feat: limit retries to 5xx and 429

aws-observability / aws-rum-web

Amazon CloudWatch RUM Web Client

Apache License 2.0

114 stars 65 forks source link

feat: limit retries to 5xx and 429 #500

Closed williazz closed 3 months ago

williazz commented 5 months ago

In general, 4xx requests should not be retried because they are caused by client errors and will never succeed. Instead, we should limit retries to 5xx which actually have a chance to work. If we do not merge this PR, then we will continue putting unnecessary strain on end users and our own servers for work that we know cannot be accomplished.

Based on public RUM docs, we should retry on 429 and 5xx and should not on 400, 403, or 404. https://docs.aws.amazon.com/cloudwatchrum/latest/APIReference/API_PutRumEvents.html#API_PutRumEvents_Errors

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ps863 commented 5 months ago

comment: certain 4xx request should be retried. Eg - Throttling 429 That said, it seems we are doing linear retry which is probably not the right practice. We should modify to to an exponential backoff.

williazz commented 5 months ago

comment: certain 4xx request should be retried. Eg - Throttling 429

Agreed. As we discussed offline, some 4xx should be retried, some 5xx should not.

That said, it seems we are doing linear retry which is probably not the right practice. We should modify to to an exponential backoff.

+1 but wanted to keep the pr focused

williazz commented 5 months ago

Related PR for exponential backoff https://github.com/aws-observability/aws-rum-web/pull/501

williazz commented 4 months ago

Based on documentation for PutRumEvents response status codes, we should not retry on 400, 403, or 404. But we should retry on 429 and 500 (5xx).

https://docs.aws.amazon.com/cloudwatchrum/latest/APIReference/API_PutRumEvents.html#API_PutRumEvents_Errors

qhanam commented 3 months ago

If we do not merge this PR, then we will continue putting unnecessary strain on our servers for work that we know cannot be accomplished.

I think a better justification is that it saves the client from doing unnecessary work. Based on the docs you linked, this would occur on ResourceNotFoundException (404), AccessDeniedException (403), or ValidationException (400).