Expensify / App

Welcome to New Expensify: a complete re-imagination of financial collaboration, centered around chat. Help us build the next generation of Expensify by sharing feedback and contributing to the code.
https://new.expensify.com
MIT License
3.16k stars 2.64k forks source link

[HOLD for payment 2024-07-24] [$250] `ping` and `ReconnectApp` are called back to back on a bad wifi network #44269

Open m-natarajan opened 4 weeks ago

m-natarajan commented 4 weeks ago

If you haven’t already, check out our contributing guidelines for onboarding and email contributors@expensify.com to request to join our Slack channel!


Version Number: Reproducible in staging?: needs reproduction Reproducible in production?: needs reproduction If this was caught during regression testing, add the test name, ID and link from TestRail: Email or phone of affected tester (no customers): Logs: https://stackoverflow.com/c/expensify/questions/4856 Expensify/Expensify Issue URL: Issue reported by: @quinthar Slack conversation: https://expensify.slack.com/archives/C05LX9D6E07/p1719023935665339

Action Performed:

  1. Be on a bad WiFi network
  2. Open the app

Expected Result:

Shouldn't call ping and reconnectApp several times

Actual Result:

on a really bad wifi network, where it concludes it's online but for some reason can't contact the server, it just hammers Pingand ReconnectApp back to back, filling the network queue with tons of parallel unfinished commands.

Workaround:

unknown

Platforms:

Which of our officially supported platforms is this issue occurring on?

Screenshots/Videos

image (16)

bugd.txt

image (17)

View all open jobs on GitHub

Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~01f591c76409c7f4d0
  • Upwork Job ID: 1805714893198480172
  • Last Price Increase: 2024-06-25
Issue OwnerCurrent Issue Owner: @kadiealexander
melvin-bot[bot] commented 4 weeks ago

Triggered auto assignment to @kadiealexander (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details. Please add this bug to a GH project, as outlined in the SO.

melvin-bot[bot] commented 4 weeks ago

Triggered auto assignment to @srikarparsi (AutoAssignerNewDotQuality)

srikarparsi commented 3 weeks ago

Making this external to see if there's a reliable way to reproduce, the root cause, and proposals to fix.

melvin-bot[bot] commented 3 weeks ago

Job added to Upwork: https://www.upwork.com/jobs/~01f591c76409c7f4d0

melvin-bot[bot] commented 3 weeks ago

Triggered auto assignment to Contributor-plus team member for initial proposal review - @rushatgabhane (External)

quinthar commented 3 weeks ago

This feels extremely easy to reproduce: just close your laptop lid for a few minutes, and reopen.

srikarparsi commented 3 weeks ago

I just tried this (closing laptop and reopen) and this was my network tab:

image

2 pings were called which isn't as many as your screenshot here but it's still back to back pings which shouldn't happen.

We already have a check to make sure that we don't send a Ping command when one is pending and this seems to be working because I don't see [NetworkConnection] recheck NetInfo in the console.

So I have two theories:

  1. The component is being re-rendered and the state of isOffline or hasPendingNetworkStatus is getting reset? So NetInfo is calling Ping again. I don't think this is the cause but I think these variables should be wrapped in a useRef since they remain for the lifetime of the component?
  2. We don't set or check for hasPendingNetworkCheck inside of NetInfo so there could be duplicate calls being made there. I think this one's more likely where a network check fails so isOffline is being set to false. Then NetInfo checks again but so does recheckNetworkConnection since hasPendingNetworkCheck isn't set by NetInfo.

I'm still looking into these but they are my initial thoughts based on the code. cc @roryabraham and @adhorodyski if you have any additional thoughts since you guys worked on these PRs to introduce NetInfo and periodic checks.

adhorodyski commented 3 weeks ago

@srikarparsi you're correct about the periodic check.

The call itself feels solid, as it should bail out if only the function early return kicks in (which from the logs looks fine, no subsequent recheck NetInfo).

If hasPendingNetworkCheck is reliable, this periodic check should cause us no harm (but that's an assumption).

adhorodyski commented 3 weeks ago

On higher-level problem I see with this implementation is that's it's really, really imperative so it's easy to make a mistake and cause such a behaviour over time. Declarative APIs work better especially with React codebases and there are open source libraries to solve just that.

srikarparsi commented 3 weeks ago

I created this PR to check if a network check is pending before starting a new one. Still need to test but I think this would be a quick way to stop repetitive calls. @adhorodyski if you could take a look at it as well that would be appreciated.

I also agree that our current implementation might not be the best way of doing it. NetInfo has parameters that we seem to be implementing in a custom way. For example, NetInfo already has reachabilityShortTimeout and reachabilityLongTimeout which are defaulted to 5s and 60s. So when the internet is not detected, it should be rechecking for connection every 5s. And when it is detected, it should be rechecking every 60s. But we had to re-add the 60s check in this PR so I think there might just be something wrong with our current implementation which we need to fix.

OlimpiaZurek commented 3 weeks ago

I wasn’t able to reproduce this issue by closing and opening the laptop lid. Every time I tried, the Ping and ReconnectApp methods were only called once. However, based on the code and description provided, it appears that the problem is related to the way the app handles network checks and reconnections. When an app determines that it is online but cannot connect to the server, it initiates multiple Ping and ReconnectApp requests simultaneously. This leads to high amounts of network traffic and unfinished commands. Reconnect logic does not control or limit the number of reconnect attempts. This can be problematic in environments with poor network conditions, leading to a constant flood of network activity.

Given this, I think adding this additional check makes sense as it ensures that a new network check only starts if there isn't already one in progress.

OlimpiaZurek commented 3 weeks ago

The change from this PR seems to cause regression.

Overall, I agree with Adam that we should adopt a more declarative approach to handling network connections. Currently, we are using an imperative approach, which seems error-prone. For example, the recheckNetworkConnection function is used both as middleware and in an interval, leading to risk of potential errors and multiple calls.

NetInfo provides built-in functions for re-checking the connection, such as reachabilityShortTimeout, which runs every 5 seconds if the Internet is not detected, and reachabilityLongTimeout, which runs every 60 seconds when the Internet is connected. These built-in mechanisms are designed to handle network rechecks reliably.

Given the complexity of our custom implementation, it's challenging to determine if the root cause of this issue is due to NetInfo or our custom logic. Therefore, maybe we should consider removing the custom recheckNetworkConnection solution and relying solely on NetInfo's built-in functionality? This approach simplifies our codebase and leverages the library's tested and optimized features.

To ensure this change meets our needs, I’d suggest to double-check that it provides the required functionality.

Given the difficulty in reproducing the issue, I believe we should conduct thorough testing to ensure that NetInfo's built-in mechanisms handle all necessary scenarios and edge cases.

To achieve this, we need to confirm which specific functionalities we want to test and verify.

Here are some examples:

srikarparsi commented 2 weeks ago

Therefore, maybe we should consider removing the custom recheckNetworkConnection solution and relying solely on NetInfo's built-in functionality?

I agree with this. And if it doesn't work and we verify that it's not a problem with our implementation, then I think it's better to make the fix upstream in NetInfo.

I think this should be the first step so I'll close this PR. @OlimpiaZurek let me know if I can do anything to help you with this.

OlimpiaZurek commented 2 weeks ago

I prepared a PR to remove hasPendingNetworkCheck flag.

I also prepared PR with the fix to the NetInfo library.

muttmuure commented 2 weeks ago

Thanks for the update!

melvin-bot[bot] commented 6 days ago

Reviewing label has been removed, please complete the "BugZero Checklist".

melvin-bot[bot] commented 6 days ago

The solution for this issue has been :rocket: deployed to production :rocket: in version 9.0.7-8 and is now subject to a 7-day regression period :calendar:. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-07-24. :confetti_ball:

For reference, here are some details about the assignees on this issue:

melvin-bot[bot] commented 6 days ago

BugZero Checklist: The PR fixing this issue has been merged! The following checklist (instructions) will need to be completed before the issue can be closed:

rushatgabhane commented 15 hours ago
  1. The PR that introduced the bug has been identified. Link to the PR: N.A. This was always there

  2. The offending PR has been commented on, pointing out the bug it caused and why, so the author and reviewers can learn from the mistake. Link to comment: N.A.

  3. A discussion in #expensify-bugs has been started about whether any other steps should be taken (e.g. updating the PR review checklist) in order to catch this type of bug sooner. Link to discussion: N.A.

  4. Determine if we should create a regression test for this bug. Yes!

  5. If we decide to create a regression test for the bug, please propose the regression test steps to ensure the same bug will not reach production again

            1. Go offline
            2. Go to network tab in browser
            3. Verify that `openApp` isn't repeatedly called