Closed williazz closed 3 months ago
We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?
The commit message "disableOnFail" doesn't really describe what the change does. Could you re-name the commit message to be more descriptive? It sounds like we are trying to solve the problem of recovering from network problems.
We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?
For zero configuration, we can remove disableOnFail
and make sure RUM only disables when dispatch fails with status code 403 or 404, and maybe 400. In the long run, we can implement offline support for more granular control.
To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.
To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.
Looks like our docs may have to be updated. Another definition could be: disable on everything that isn't 429 or 5xx.
Revision: I have updated the PR so that RUM only disables when Dispatch fails with 403 or 404. Those are the only status codes in which we know RUM will never succeed. In all other cases, RUM should stay alive.
So now, if my understanding is correct, the behavior is to:
I have two questions.
Those questions aren't blocking. I think this is a good change and can be merged.
When a dispatch fails (it gets three attempts), then the payload is dropped.
Almost, RUM only retries on 429 and 5xx. Then the payload is dropped.
When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?
RUM is not sending 401 status codes, so I didn't include it. But I can include that in case someone's proxy uses that code.
Long-term, is dropping payloads the behavior we want?
Dropping payloads is not the behavior we want long term, but I haven't implemented it because there are still open questions:
A. How should dropped events be reinserted into the queue and what priority should they be given against the limit?
B. Should customers be allowed to choose an offline strategy? e.g. Reinsert into the queue OR Write to localStorage.
C. What additional metadata can we provide to site operators to root cause "offline" events which will likely have more errors? (This can probably be resolved by providing the Navigator.onLine
attribute)
To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.
When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?
Added 401!
Looks like I missed this, although I already pointed out why we are still disabling the RUM for 401 code, 401 is a valid usecase where the proxy server might return with (Unauthorized Token) and browser might retry later with a valid token. Please help to remove 401.
Addresses an immediate problem in #521, highlighted in https://github.com/aws-observability/aws-rum-web/issues/521#issuecomment-1993613490
To summarize, RUM currently disables itself when the attempt to PutRumEvents (PRT) fails and all retries have been exhausted. The intention was to avoid giving the web application needless errors when we know that RUM the network connection is disabled anyways.
The problem is that sometimes the network connection recovers, and RUM will experience data loss because it is needlessly disabled. This PR solves that problem by adding configuration for
disableOnFail
(default true). When site operator decides to set this to false, then RUM will not disable on PRT failure, and be able to resume sending events when the network connection recovers.This PR should be shipped out with #500 and #501 to limit the number of unnecessary retries and errors.
In the long run, RUM should listen for
online
andoffline
DOM events (MDN docs). When offline, we should write events to localStorage instead of the PRT endpoint. When online,Dispatch
should attempt to combine the current batch with events recorded offline.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.