Closed williazz closed 3 months ago
comment: certain 4xx request should be retried. Eg - Throttling 429 That said, it seems we are doing linear retry which is probably not the right practice. We should modify to to an exponential backoff.
comment: certain 4xx request should be retried. Eg - Throttling 429
Agreed. As we discussed offline, some 4xx should be retried, some 5xx should not.
That said, it seems we are doing linear retry which is probably not the right practice. We should modify to to an exponential backoff.
+1 but wanted to keep the pr focused
Related PR for exponential backoff https://github.com/aws-observability/aws-rum-web/pull/501
Based on documentation for PutRumEvents response status codes, we should not retry on 400, 403, or 404. But we should retry on 429 and 500 (5xx).
If we do not merge this PR, then we will continue putting unnecessary strain on our servers for work that we know cannot be accomplished.
I think a better justification is that it saves the client from doing unnecessary work. Based on the docs you linked, this would occur on ResourceNotFoundException (404), AccessDeniedException (403), or ValidationException (400).
In general, 4xx requests should not be retried because they are caused by client errors and will never succeed. Instead, we should limit retries to 5xx which actually have a chance to work. If we do not merge this PR, then we will continue putting unnecessary strain on end users and our own servers for work that we know cannot be accomplished.
Based on public RUM docs, we should retry on 429 and 5xx and should not on 400, 403, or 404. https://docs.aws.amazon.com/cloudwatchrum/latest/APIReference/API_PutRumEvents.html#API_PutRumEvents_Errors
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.