firebase / firebase-js-sdk

Firebase Javascript SDK
https://firebase.google.com/docs/web/setup
Other
4.86k stars 895 forks source link

starting a snapshot listener from afar causes problems #8451

Open michaelAtCoalesce opened 3 months ago

michaelAtCoalesce commented 3 months ago

Operating System

windows

Environment (if applicable)

chrome

Firebase SDK Version

10.13.0

Firebase SDK Product(s)

Firestore

Project Tooling

create-react-app example

Detailed Problem Description

i have a collection with ~50 megabytes of data across ~1500 documents.

when i try to start a listener while in north america (connecting to US firestore), it takes only 10 seconds to start the listener, and it completes 100% of the time

image

. when i turn on my india VPN (same machine, same code, only difference is routing through india), the listener never even completes.

i immediately get these errors -

image

on some machines from APAC region connecting to US firestore, i also get really poor behavior, and it never actually succeeds.

image

for what its worth - the connection FROM india to united states should easily be able to handle this...

image

Steps and code to reproduce issue

if someone wants to email me i can send them the info for the recreate. it's a few lines of code.

google-oss-bot commented 3 months ago

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

michaelAtCoalesce commented 3 months ago

based on previous issues @dconeybe might be a good person to look at this? Someone can email me at 'mx2323 <@> gmail.com' and I'll jump on a call to recreate.

It's very concerning behavior so we'd like to get this looked at ASAP.

michaelAtCoalesce commented 3 months ago

any updates? this is causing our application to not load.

wu-hui commented 3 months ago

Hey @michaelAtCoalesce ,

I suspect the issue here is that the bidirectional stream between the SDK and the backend does not work well when the network is not stable, especially when you need to load a lot data over the wire. There are several things you can try:

  1. Can you create a test firestore in Asian to see if things improve?
  2. Try to always turn on longpolling (https://firebase.google.com/docs/reference/js/firestore_.firestoresettings.md#firestoresettingsexperimentalforcelongpolling) see if that helps.

@sampajano Do you have some other suggestions/ideas?

michaelAtCoalesce commented 3 months ago

hi @wu-hui,

  1. yes, this is definitely the case. as i said above, when i turn off the VPN (so connect from nearby) - the reliability is 100%
  2. i have forced long polling, but i am noticing that i see slower load times. more consistent, but slower load times.

the data here is only on the order of tens of megabytes - but it can take over a minute (and sometimes not load at all), whereas a closer connection will take ~10 seconds.

in conclusion: the speed test shows that this should work, so it seems there is work to do here for firestore to reliability support this kind of a connection... it shouldn't take over a minute to load when it takes 10 seconds in the ideal case and even the force long polling is slow.

i can reliably recreate this issue 100% of the time, within seconds. happy to hop on a call and share recreate details (or do it over email). you can email me at mx2323 <@> gmail.com

michaelAtCoalesce commented 3 months ago

why was the needs attention label removed and a needs-info label added? i believe ive given the information required and this is causing our production to not load.

DellaBitta commented 3 months ago

My mistake! I must have had been looking at a stale page that I had loaded yesterday, sorry!

michaelAtCoalesce commented 3 months ago

from nearby: 11seconds from afar: 226 seconds

im uploading some firestore debug level logs of the degenerative case here.

aec2-38-34-123-154.ngrok-free.app-1724779566040.log

michaelAtCoalesce commented 3 months ago

any updates?

MarkDuckworth commented 3 months ago

@michaelAtCoalesce, Thank you for providing the logs. I reviewed them and I don't see a clear indication of an issue in the SDK, however I have forwarded this to our backend team for review. Googlers see b/361143373

For what it's worth, on this behavior, you may get more frequent updates if you open a Firebase or Google Cloud support ticket rather than a GitHub issue on the SDK. However, we will update this GH issue when we learn more.

michaelAtCoalesce commented 3 months ago

@MarkDuckworth thanks for the update. wanted to add another data point in here. it appears to happen more frequently on windows. i have anecdotally noticed in my recreate case that on chrome on windows the default implementation is more likely to fail than chrome on macOS. if nothing else, it appears that windows is at least 3x as slow.

something appears to happen where the default implementation will start up, download a bit, then just hang there for tens of seconds or minutes and not do anything. when i turn on the experimental long polling option, it immediately goes back to working.

michaelAtCoalesce commented 2 months ago

its been 2 weeks.. any updates?

i was told by firebase support to try to paginate the snapshots. i tried that, and the performance did not improve and sometimes the listeners do not ever start still. this appears to happen more frequently on windows. it appears that this issue appears even when the user is nearby the firestore location.

on my mac it'll take 12 seconds, the exact same page on a windows e2-standard-2 instance has 400 errors, 404 errors, and takes over 3 minutes sometimes) for the same test case.

image

vdemko001 commented 2 months ago

Any updates? I'm getting the same error on Firebase SDK 7.24.0 image

mx2323 commented 1 month ago

@MarkDuckworth I think the enabling of web channel in 10.14.0 is improving the performance of my standalone test case. Sounds like with large datasets and chunking there may have been a corruption causing memory issues as a symptom that was fixed.

We are still having an issue where inexplicably firestore will take 50 seconds within our app to listen on windows, but in the standalone test case on windows it’ll take 15 seconds, consistently. Our users frequently open and close listeners as part of their workflow so the performance is important for them.

once I get some more data will open a ticket for that

MarkDuckworth commented 1 month ago

@mx2323, thanks for the feedback on 10.14.0. It's good to hear it is helping you out.

Regarding the 50 second delay, is it consistently 50 seconds? Take a look at https://github.com/firebase/firebase-js-sdk/issues/8474, it could be related to that. Although we have only seen that delay at ~45 seconds. Your logs in https://github.com/firebase/firebase-js-sdk/issues/8451#issuecomment-2313138710 don't show similarity, but if you're seeing a consistent 45-50 second delay, it may be worth getting logs covering this timespan.

mx2323 commented 1 month ago

ok I’ll follow up soon. headed on a flight and will be unavailable.

michaelAtCoalesce commented 1 month ago

OK - debug logs attached

  1. standalone recreate on windows india vm - 40 seconds standalone-windows-40seconds.log

  2. in-app recreate on windows india vm - 75 seconds windows-insideapp-75seconds.log

    image

I think the absolute numbers changed here because of the added debug logging.

these are on the same machine, same browser too. just that one is a standalone one that does nothing else, the other is a version of our app that is very stripped down.. i took a look at the logs with the 75 second issue, it looks like it spends ~25 seconds of that doing pretty much nothing with "detecting buffered proxy" resulting in"The Operation Could not be completed" a few times .... then "detected no buffering proxy" prints and it immediately starts downloading and working as it should.

any explanation on why the same machine would sometimes say 'detect buffering proxy' and then other times not? what's odd is that in the standalone app, its never detecting a buffered proxy, but something about once the same code is running in our app, it'll detect a buffered proxy.

also - is there a log level that doesnt print as much but also tells us key information about whether a buffering proxy was detected? the logs when turned on slow things down so much it'd be good to have closer to the actual situation and less verbose logging.

i think what this is telling us is that this proxy detection code isn't working correctly on this machine

michaelAtCoalesce commented 1 month ago

Is there someone I can send a recreate to? This behavior is very problematic for us and it’s been a month and a half. It’s a 20 megabyte collection with 2000 documents this is well within whey should be supported

wu-hui commented 1 month ago

One way to share the reproduction is to create a private repo, and invite me and @sampajano to join.

Please provide a clear instruction on how to reproduce, especially this seems to involve some VM setup and VPNs.

Also, we have no control over VPN or public internet, speed test does not necessarily translate to actual connection. We will do our best to look into this, but it is certainly possible that this won't lead us anywhere.

google-oss-bot commented 1 month ago

Hey @michaelAtCoalesce. We need more information to resolve this issue but there hasn't been an update in 5 weekdays. I'm marking the issue as stale and if there are no new updates in the next 5 days I will close it automatically.

If you have more information that will help us get to the bottom of this, just add a comment!

google-oss-bot commented 1 month ago

Since there haven't been any recent updates here, I am going to close this issue.

@michaelAtCoalesce if you're still experiencing this problem and want to continue the discussion just leave a comment here and we are happy to re-open this.

michaelAtCoalesce commented 3 weeks ago

what we noticed and had filed a ticket for originally was that the client sdk was very unreliable in windows chrome browsers with slightly larger datasets.

We noticed that the backend Firestore client has a preferRest option that was much faster and more reliable. We noticed that although the rest api was faster, we still had issues where concurrent requests within the same backend process were slow. I filed a ticket about this here https://github.com/googleapis/nodejs-firestore/issues/2215. Ultimately I think the unzipping of hundreds of megabytes of data from the rest api is just slow and hogs the main thread because forking processes and executing on the same machine was parallel and fast.

After all this investigation, what we decided to do is reverse engineer with wireshark what the backend was doing with preferRest (since the runQuery docs are not helpful https://firebase.google.com/docs/firestore/reference/rest/v1beta1/projects.databases.documents/runQuery) and call that via the frontend directly. Then we have changed every write in our app to store server timestamps, and we changed our snapshot listener to filter for all values higher than the values seen than what we pulled down in the starting snapshot. We also implemented a separate collection so we can have deletes.

So far in our extensive testing the result has more reliable and consistent performance than the Firestore client web sdk. It's unfortunate that we had to do this, but I repeatedly reached out to Firestore team members over many months and sent logs and offered multiple times to jump on a call with no luck.

wu-hui commented 2 weeks ago

Thanks for the update, it is great that you found a way that works for you.

I have couple questions:

  1. runQuery does not support realtime queries, how do you use snapshot listener still?
  2. About "Then we have changed every write in our app to store server timestamps, and we changed our snapshot listener to filter for all values higher than the values seen than what we pulled down in the starting snapshot.". Does this mean eventually you get the improvements by having less documents sent to the client?
michaelAtCoalesce commented 2 weeks ago
  1. we still use the client SDK for realtime updates. we go through the data we pulled down via rest API, then set a filter on the firestore sdk listener to start after the highest timestamp.
  2. no, its the same number of documents. what we do is pull down our own snapshot of the data via the REST api (because the client SDK performance is so unreliable on windows), then use the client SDK to listen to all changes starting from the highest "updatedAt" value.