[Refactor] Network Request Queue

kidroca commented 3 years ago

If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!

Details

ATM network request queue runs relentlessly using a setInterval of 1 second Every second it will check are there any request and send them, there are long periods of time (e.g. 10 or more seconds) where no requests will be scheduled, but the interval will still run causing interrupts affecting the main thread

https://github.com/Expensify/Expensify.cash/blob/1af601ee7947ce6b0027ace470cbe9ea406480bb/src/libs/Network.js#L169-L170

There are also cases where a request will become unsolvable it will fail over and over and the queue will send it over and over in a endless loop

I can't trace precisely when this happens, but the proposal below can detect this and allow us to do something about it

Proposal

The code can be refactored in a way that will stop the queue from running when it's not necessary

Instead of pushing to a queue - post performs the request, if it happens to fail, make the same request either after a fixed timeout - setTimeout or after an event that gives the green light

Handling would become simpler as well as more performant. Here's the gist of it

function post(params) {
  return HttpUtils.xhr(params)
     .then(response => onResponse(response))
     .catch(error => {
        onError(error);
        return waitForRetry(1000).then(() => post(params));
     });
}

// This can wait for a green light event or be time based
function waitForRetry(ms) {
  return new Promise(res => setTimeout(res, ms));
}

The point is that besides making things simpler, we no longer have to use an interval timer and cause unnecessary interrupts every second, but only when we actually need to retry something

Making waitForRetry event driven: When we're offline or the queue is paused for some reason, we can capture a deferred promise, then we don't have to retry after a second, but when the deferred promise is resolved The promise is resolved when we're both back online and not paused
Adding a retry limit The post function can be setup to use a maxRetries parameter that will help stop the recursion if we're getting into an endless retry cycle for some reason

function post(params, maxRetries = 10) {
  if (maxRetries == 0) {
    const error = new Error('A request reached the max retries limit');
    error.request = params;

    return Promise.reject(error);
  }

  return HttpUtils.xhr(params)
     .then(response => onResponse(response))
     .catch(error => {
        onError(error);
        return waitForRetry(1000).then(() => post(params, maxRetries - 1));
     });
}

We start with a default value and decrement until we reach 0, if for some reason the request won't make it we stop retrying.

Platform:

Where is this issue occurring?

Web ✔️ iOS ✔️ Android ✔️ Desktop App ✔️ Mobile Web ✔️

Version Number: 1.0.69-0 Logs: https://stackoverflow.com/c/expensify/questions/4856 Notes/Photos/Videos: Any additional supporting documentation Expensify/Expensify Issue URL:

View all open jobs on Upwork

MelvinBot commented 3 years ago

Triggered auto assignment to @kadiealexander (AutoAssignerTriage), see https://stackoverflow.com/c/expensify/questions/4749 for more details.

kidroca commented 3 years ago

Optional future improvements

If a request can't complete for our maxRetries it's a tell that something is really bad and we can signal another internal system to take over

  if (maxRetries == 0) {
    const error = new Error('A request reached the max retries limit');
    error.request = params;

    onMaxRetries(params);

    return Promise.reject(error);
  }

The onMaxRetries is out of the current scope, but the idea is that it will (for example)

raise/recheck the offline flag
capture the request to storage
log an error
retry the request at some point

Capturing running requests We can capture pending requests
- for debugging purposes - see the list of pending requests at a given time
- if it's ever the case that a request is repeated while a 100% match is still running we can hook the incoming request to the pending request

function post(params) {
  // actual matching logic might be more complex than this
  const pending = pendingRequests.find(r => _.equals(r.params, params));

  if (pending) return pending;

  const requestPromise = HttpUtils.xhr(params)
     .then(response => onResponse(response))
     .catch(error => {
        onError(error);
        return waitForRetry(1000).then(() => post(params));
     })
     // clean up the completed requests from our pending list
     .finally(() => pendingRequests = pendingRequests.filter(r => r != requestPromise));

 pendingRequests.push(requestPromise);

 return requestPromise;
}

I don't know it's ever the case that an identical request is made at the same time, but this could be helpful for debugging

MelvinBot commented 3 years ago

Triggered auto assignment to @alex-mechler (Engineering), see https://stackoverflow.com/c/expensify/questions/4319 for more details.

alex-mechler commented 3 years ago

cc @marcaaron and @tgolen since you worked a lot with the network queue early on. I remember we had reasons for going with a queue, rather than a direct approach like this, but they are escaping me atm. Do you remember?

marcaaron commented 3 years ago

The main reasons for using the queue (as far as I know) are to support the offline features. So, if you are offline the request you attempted to make is placed in the queue and then processed once you are back online.

That said, I can't really think of any benefit to having a write queue that is getting processed in a loop every seconds (even when empty). I have some other priorities to get to and can't get too involved atm, but @cead22 might also have some opinions here.

cead22 commented 3 years ago

The main reasons for using the queue (as far as I know) are to support the offline features. So, if you are offline the request you attempted to make is placed in the queue and then processed once you are back online.

This is correct

but the interval will still run causing interrupts affecting the main thread

What are the real-world user facing consequences of this? afaik this doesn't slow anything down. In fact we check if we're online and if the network queue has anything in it and if not, we return early. I imagine this whole code runs in single-digit milliseconds if not a fraction of a millisecond

function processNetworkRequestQueue() {
    // NetInfo tells us whether the app is offline
    if (isOffline) {
        if (!networkRequestQueue.length) {
            return;
        }

        // If we have a request then we need to check if it can be persisted in case we close the tab while offline
        const retryableRequests = _.filter(networkRequestQueue, request => (
            !request.data.doNotRetry && request.data.persist
        ));
        Onyx.set(ONYXKEYS.NETWORK_REQUEST_QUEUE, retryableRequests);
        return;
    }

    // When the queue length is empty an early return is performed since nothing needs to be processed
    if (networkRequestQueue.length === 0) {
        return;
    }

I can't trace precisely when this happens, but the proposal below can detect this and allow us to do something about it

Can you reproduce?

The promise is resolved when we're both back online and not paused

This logic is also built into the network queue

2. Adding a retry limit

This logic can be added to the network queue.

I'm not a main contributor to this repo, but I don't agree with the problem, so this refactoring doesn't seem worth it with the information I have so far.

kidroca commented 3 years ago

Let's forget the performance implications

The current pattern is creating unnecessary complexity which get in the way of writing straightforward code: Take for example pauseRequestQueue and unpauseRequestQueue, you'd think they will just pause/unpause the setInterval but they aren't, because some requests still need to run even on a paused queue. This is again something that can be handled quite more gracefully with the promise pattern I've posted

When the queue is "paused" the short returns do not take effect and it constantly foreaches and re-queues the pending requests - thus the 200+ retry attempts I've posted about here: https://expensify.slack.com/archives/C01GTK53T8Q/p1623747094347300

setInterval is dangerous, there are numerous topics about it and I would avoid it when a more straightforward alternative exists

There are also cases where a request will become unsolvable it will fail over and over and the queue will send it over and over in a endless loop I can't trace precisely when this happens, but the proposal below can detect this and allow us to do something about it

Can you reproduce?

No, I've seen it happen several times and others have reported similar issues, it might not be caused by the queue but a retry limit can help review the root issue

tgolen commented 3 years ago

I've kind of been following this issue, and I went and re-read the original thread in Slack that spawned this issue.

I think this is chasing a lot of theoretical problems without getting crisp on the specific problem (this seems to be a repeating theme).

The original assumption was:

Many retries on startup are causing performance problems on Android

However, this was just conjecture and was only theoretical. Let's start here. If that's the hypothesis, then let's test it out and ensure that this is indeed the problem. Without proving that it's a problem (or the size of the problem), then chasing after solutions isn't going to benefit anyone.

marcaaron commented 3 years ago

Going to close this in favor of https://github.com/Expensify/Expensify.cash/issues/4026

It seems like we know there is some kind of regression with the network request queue, but have no clear reproduction. It also seems like the changes here might be a potential solution to that. But we should focus on the issue that has been reported rather than put the solution first. Thanks!

kidroca commented 3 years ago

When we added Flipper bridgespy I've noticed that all timers (setTimeout/setInterval) would make an inbound and outbound calls through the react native bridge, each time a timer runs

The more interval timers we have the more calls that happen through the bridge Ideally we want to reduce calls through the bridge as much as possible as that's the weak spot for React Native That might change when JSI becomes the norm and there's no bridge

marcaaron commented 3 years ago

Ideally we want to reduce calls through the bridge as much as possible as that's the weak spot for React Native

I've thought about this before as well, but can't really think of what problem the "bridge noise" is contributing to. It does make it kind of hard to spy the bridge traffic since the timers are relentlessly ticking. But what's bad about it?

kidroca commented 3 years ago

I've thought about this before as well, but can't really think of what problem the "bridge noise" is contributing to. It does make it kind of hard to spy the bridge traffic since the timers are relentlessly ticking. But what's bad about it?

Imagine it's literally a bottle neck, it can allow only so much traffic as the width of the cap, when a lot of items move through the bridge, some would get batched to wait as there's not enough bandwidth, this creates lag This is because all the calls get serialised and deserialised to be passed to/from the native world The re-architecture of react-native aims to replace the bridge, or at least allow libraries like reanimated to communicate directly with the JS thread without the need of a syncing bridge

The network queue is not the only interval timer that we have, they take a hit on that bridge bandwidth every second

I'm not certain how can we measure the effect of this, the theory is that in moments of high intensity these extra calls would negatively affect performance, or maybe they add a small hit every second but it's not something noticeable

kidroca commented 3 years ago

One other thing in this regard is there's no idle time, something is always running

If you've noticed timers on inactive browser tabs get throttled, I think one of the reasons was significant battery usage due to constantly running code I remember reading about this somewhere, but so far this is the only thing that comes up: https://stackoverflow.com/questions/11788928/how-much-battery-life-can-setinterval-suck-up

Expensify / App