feat: keep alive when dispatch fails

williazz commented 3 months ago

Addresses an immediate problem in #521, highlighted in https://github.com/aws-observability/aws-rum-web/issues/521#issuecomment-1993613490

To summarize, RUM currently disables itself when the attempt to PutRumEvents (PRT) fails and all retries have been exhausted. The intention was to avoid giving the web application needless errors when we know that RUM the network connection is disabled anyways.

The problem is that sometimes the network connection recovers, and RUM will experience data loss because it is needlessly disabled. This PR solves that problem by adding configuration for disableOnFail (default true). When site operator decides to set this to false, then RUM will not disable on PRT failure, and be able to resume sending events when the network connection recovers.

This PR should be shipped out with #500 and #501 to limit the number of unnecessary retries and errors.

In the long run, RUM should listen for online and offline DOM events (MDN docs). When offline, we should write events to localStorage instead of the PRT endpoint. When online, Dispatch should attempt to combine the current batch with events recorded offline.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

qhanam commented 3 months ago

We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?

qhanam commented 3 months ago

The commit message "disableOnFail" doesn't really describe what the change does. Could you re-name the commit message to be more descriptive? It sounds like we are trying to solve the problem of recovering from network problems.

williazz commented 3 months ago

We want to be careful not to add unnecessary configuration options. In this case, my intuition is that we can build a solution that just works for all users. What would be a permanent solution that doesn't involve adding a configuration option?

For zero configuration, we can remove disableOnFail and make sure RUM only disables when dispatch fails with status code 403 or 404, and maybe 400. In the long run, we can implement offline support for more granular control.

gmayankcse15 commented 3 months ago

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

williazz commented 3 months ago

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

Looks like our docs may have to be updated. Another definition could be: disable on everything that isn't 429 or 5xx.

williazz commented 3 months ago

Revision: I have updated the PR so that RUM only disables when Dispatch fails with 403 or 404. Those are the only status codes in which we know RUM will never succeed. In all other cases, RUM should stay alive.

qhanam commented 3 months ago

So now, if my understanding is correct, the behavior is to:

When a dispatch fails (it gets three attempts), then the payload is dropped.
If the status code is 403 (Forbidden) or 404 (Not Found), the web client disables itself.
Otherwise, the web client will continue as normal, which is to attempt dispatch every 5 seconds.

I have two questions.

When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?
Long-term, is dropping payloads the behavior we want? We could add them back to the cache. Saving to localStorage is another option. Although both of those might just be adding unnecessary complexity.

Those questions aren't blocking. I think this is a good change and can be merged.

williazz commented 3 months ago

When a dispatch fails (it gets three attempts), then the payload is dropped.

Almost, RUM only retries on 429 and 5xx. Then the payload is dropped.

When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?

RUM is not sending 401 status codes, so I didn't include it. But I can include that in case someone's proxy uses that code.

Long-term, is dropping payloads the behavior we want?

Dropping payloads is not the behavior we want long term, but I haven't implemented it because there are still open questions:

A. How should dropped events be reinserted into the queue and what priority should they be given against the limit? B. Should customers be allowed to choose an offline strategy? e.g. Reinsert into the queue OR Write to localStorage. C. What additional metadata can we provide to site operators to root cause "offline" events which will likely have more errors? (This can probably be resolved by providing the Navigator.onLine attribute)

williazz commented 3 months ago

To add we should not disable RUM for 401 (Unauthorized error) , I had the usecase where due to invalid/expired token , fetch is not able to redirect to token provider website and refresh the token.

When would we receive 401 (Unauthorized) vs 403 (Forbidden)? Why is one recoverable, but not the other?

Added 401!

mayankcse15 commented 1 month ago

Looks like I missed this, although I already pointed out why we are still disabling the RUM for 401 code, 401 is a valid usecase where the proxy server might return with (Unauthorized Token) and browser might retry later with a valid token. Please help to remove 401.

aws-observability / aws-rum-web

feat: keep alive when dispatch fails #524