firebase / firebase-tools

The Firebase Command Line Tools
MIT License
4.02k stars 943 forks source link

Firebase Firestore Emulator Degrades for Batch Writes #3477

Closed williamhaley closed 1 year ago

williamhaley commented 3 years ago

Environment info

firebase-tools:

% firebase --version
9.12.1
"firebase-tools": "^9.12.1"

Platform:

macOS 11.4

Test case

https://github.com/williamhaley/firestore-emulator-slow

Steps to reproduce

  1. Clone the test case repo and npm i to get dependencies
  2. Start the emulators
    GCLOUD_PROJECT=fake-project-id ./node_modules/.bin/firebase emulators:start --import=emulator-data --export-on-exit --debug --project fake-project-id
  3. Try running a simple batch write demo application in my test case repo to insert a large number of documents
    NUM_RECORDS=10000000 GCLOUD_PROJECT=fake-project-id FIRESTORE_EMULATOR_HOST=localhost:8080 go run demo.go
  4. Note that the average time/rate for a batch to write ~steadily gets worse~ seems fine until it hits a wall, then steadily gets worse. Overall the trend seems slower, but subsequent runs (even after deleting logs, stored data) seem worse for reasons unclear.
    % NUM_RECORDS=1000000 GCLOUD_PROJECT=fake-project-id FIRESTORE_EMULATOR_HOST=localhost:8080 go run demo.go
    done writing [0-500] 500 result(s) [rate: 269.626908/s]
    done writing [500-1000] 500 result(s) [rate: 5542.998078/s]
    done writing [1000-1500] 500 result(s) [rate: 12494.110276/s]
    done writing [1500-2000] 500 result(s) [rate: 15573.126888/s]
    done writing [2000-2500] 500 result(s) [rate: 15856.192201/s]
    done writing [2500-3000] 500 result(s) [rate: 15354.158404/s]
    done writing [3000-3500] 500 result(s) [rate: 13338.180606/s]
    done writing [3500-4000] 500 result(s) [rate: 12831.390220/s]
    done writing [4000-4500] 500 result(s) [rate: 13419.423047/s]
    ...
    done writing [47000-47500] 500 result(s) [rate: 27625.403725/s]
    done writing [47500-48000] 500 result(s) [rate: 29769.098961/s]
    done writing [48000-48500] 500 result(s) [rate: 337.738290/s]
    done writing [48500-49000] 500 result(s) [rate: 146.222209/s]
    done writing [49000-49500] 500 result(s) [rate: 146.691434/s]
    done writing [49500-50000] 500 result(s) [rate: 106.763477/s]
    done writing [50000-50500] 500 result(s) [rate: 135.148104/s]
    done writing [50500-51000] 500 result(s) [rate: 113.846432/s]
    done writing [51000-51500] 500 result(s) [rate: 104.573181/s]
    done writing [51500-52000] 500 result(s) [rate: 97.531589/s]
    done writing [52000-52500] 500 result(s) [rate: 84.426878/s]
    done writing [52500-53000] 500 result(s) [rate: 70.867284/s]
    done writing [53000-53500] 500 result(s) [rate: 81.313259/s]
    done writing [53500-54000] 500 result(s) [rate: 64.313282/s]
    done writing [54000-54500] 500 result(s) [rate: 66.894208/s]
    done writing [54500-55000] 500 result(s) [rate: 72.471526/s]
    done writing [55000-55500] 500 result(s) [rate: 56.617615/s]
    done writing [55500-56000] 500 result(s) [rate: 60.438153/s]
    done writing [56000-56500] 500 result(s) [rate: 57.030455/s]
    done writing [56500-57000] 500 result(s) [rate: 57.240398/s]
    done writing [57000-57500] 500 result(s) [rate: 54.856295/s]
    done writing [57500-58000] 500 result(s) [rate: 59.126199/s]
    ...
    % NUM_RECORDS=1000000 GCLOUD_PROJECT=fake-project-id FIRESTORE_EMULATOR_HOST=localhost:8080 go run demo.go
    done writing [0-500] 500 result(s) in 0.975259 seconds [rate: 512.684503/s]
    done writing [500-1000] 500 result(s) in 0.185842 seconds [rate: 2690.451492/s]
    done writing [1000-1500] 500 result(s) in 0.197749 seconds [rate: 2528.460874/s]
    done writing [1500-2000] 500 result(s) in 0.142909 seconds [rate: 3498.734711/s]
    done writing [2000-2500] 500 result(s) in 0.168920 seconds [rate: 2959.978165/s]
    done writing [2500-3000] 500 result(s) in 0.158540 seconds [rate: 3153.782981/s]
    done writing [3000-3500] 500 result(s) in 0.161456 seconds [rate: 3096.820962/s]
    done writing [3500-4000] 500 result(s) in 0.143788 seconds [rate: 3477.351847/s]
    ...
    done writing [16500-17000] 500 result(s) in 0.398754 seconds [rate: 1253.904584/s]
    done writing [17000-17500] 500 result(s) in 0.655998 seconds [rate: 762.196869/s]
    done writing [17500-18000] 500 result(s) in 0.415008 seconds [rate: 1204.796763/s]
    done writing [18000-18500] 500 result(s) in 0.437041 seconds [rate: 1144.057967/s]
    done writing [18500-19000] 500 result(s) in 0.461470 seconds [rate: 1083.494949/s]
    done writing [19000-19500] 500 result(s) in 0.460002 seconds [rate: 1086.952070/s]
    done writing [19500-20000] 500 result(s) in 0.477613 seconds [rate: 1046.872436/s]
    done writing [20000-20500] 500 result(s) in 0.494885 seconds [rate: 1010.336331/s]
    done writing [20500-21000] 500 result(s) in 0.701484 seconds [rate: 712.774808/s]
    done writing [21000-21500] 500 result(s) in 0.497308 seconds [rate: 1005.412960/s]
    done writing [21500-22000] 500 result(s) in 0.506893 seconds [rate: 986.401775/s]
    done writing [22000-22500] 500 result(s) in 0.561231 seconds [rate: 890.898083/s]
    ...
    done writing [45500-46000] 500 result(s) in 0.726087 seconds [rate: 688.622994/s]
    done writing [46000-46500] 500 result(s) in 0.695046 seconds [rate: 719.376493/s]
    done writing [46500-47000] 500 result(s) in 0.745275 seconds [rate: 670.893113/s]
    done writing [47000-47500] 500 result(s) in 0.736498 seconds [rate: 678.888768/s]
    done writing [47500-48000] 500 result(s) in 0.811997 seconds [rate: 615.765996/s]
    done writing [48000-48500] 500 result(s) in 0.749159 seconds [rate: 667.414800/s]
    ...
    done writing [59000-59500] 500 result(s) in 1.118787 seconds [rate: 446.912714/s]
    done writing [59500-60000] 500 result(s) in 1.347482 seconds [rate: 371.062431/s]
    done writing [60000-60500] 500 result(s) in 1.160734 seconds [rate: 430.761933/s]
    done writing [60500-61000] 500 result(s) in 1.331365 seconds [rate: 375.554300/s]
    done writing [61000-61500] 500 result(s) in 1.109495 seconds [rate: 450.655512/s]
    done writing [61500-62000] 500 result(s) in 0.818895 seconds [rate: 610.578777/s]
    done writing [62000-62500] 500 result(s) in 0.899672 seconds [rate: 555.757867/s]
    done writing [62500-63000] 500 result(s) in 0.880934 seconds [rate: 567.579713/s]
    done writing [63000-63500] 500 result(s) in 0.811197 seconds [rate: 616.373263/s]
    done writing [63500-64000] 500 result(s) in 1.544817 seconds [rate: 323.663024/s]
    done writing [64000-64500] 500 result(s) in 0.841225 seconds [rate: 594.371420/s]
  5. The emulator throws errors implying it is not processing new messages
    210609 11:34:17.543:W 111 [Thread-49] [com.google.cloud.datastore.emulator.firestore.webchannel.FirestoreV1WebChannelAdapter$FirestoreListenHandler.onError:165] [ERROR] N
    ETWORK_ERROR ()
    210609 11:34:17.543:W 111 [Thread-49] [com.google.net.webchannel.server.v8.BackChannel.add:79] Failed to send a new message due to too many pending messagings in the back
    channel (10001). May need enable flow control.
    210609 11:34:17.543:W 111 [Thread-49] [com.google.net.webchannel.server.v8.ChannelInternalImpl.handle:502] Server fails to send the message, abort the channel!
  6. The client continues, but at slower and slower rates
  7. ctrl+c to kill the client application
  8. ctrl+c to stop the emulator
  9. Start the emulator again with --debug
  10. The emulator spews a ton of log messages and nothing appears in the UI

Note: The firebase-debug.log was not attached as it is very large 9.6G Jun 9 11:58 firebase-debug.log

ui-debug.log

Expected behavior

The emulator can sustain a hypothetical real-world bulk import of a large number of documents.

Writes happen at a continuous rate, or the emulator provides meaningful feedback that writes are slowing to the point that the emulator is throwing some data away as it is received.

Actual behavior

The server/emulator becomes unresponsive and slows down over time. Client calls to the emulator return without an error, but the server seems to lose data while writing. The emulator is unusable and must be stopped as it never catches up with the import processing load.

williamhaley commented 3 years ago

Found something interesting this evening. I tried re-running my same tests that earlier performed poorly; those tests saw substantial performance degradation/slow down. This evening though I saw stable reliable performance up to 1,000,000 records for batch writes!

% GCLOUD_PROJECT=fake-project-id ./node_modules/.bin/firebase emulators:start --import=emulator-data --export-on-exit --debug
% NUM_RECORDS=1000000 GCLOUD_PROJECT=fake-project-id FIRESTORE_EMULATOR_HOST=localhost:8080 go run demo.go
...
done writing [994000-994500] 500 result(s) in 0.015791 seconds [rate: 31663.770276/s]
done writing [994500-995000] 500 result(s) in 0.015554 seconds [rate: 32145.534407/s]
done writing [995000-995500] 500 result(s) in 0.017201 seconds [rate: 29067.856061/s]
done writing [995500-996000] 500 result(s) in 0.016630 seconds [rate: 30065.469363/s]
done writing [996000-996500] 500 result(s) in 0.017246 seconds [rate: 28992.293964/s]
done writing [996500-997000] 500 result(s) in 0.016450 seconds [rate: 30394.733979/s]
done writing [997000-997500] 500 result(s) in 0.015604 seconds [rate: 32043.601857/s]
done writing [997500-998000] 500 result(s) in 0.015377 seconds [rate: 32516.010884/s]
done writing [998000-998500] 500 result(s) in 0.015700 seconds [rate: 31847.667257/s]
done writing [998500-999000] 500 result(s) in 0.013916 seconds [rate: 35929.431148/s]
done writing [999000-999500] 500 result(s) in 0.016410 seconds [rate: 30468.366432/s]
done writing [999500-1000000] 500 result(s) in 0.019448 seconds [rate: 25709.745814/s]

I was unable to replicate the poor batch write conditions I'd seen earlier. No matter if I stopped the server, restarted it, it was performing wonderfully. Which is good, but I couldn't figure out what had changed.

THEN, I navigated to the UI in a browser. That is something I had done earlier today in testing when I filed this issue, but did not do in my most recent round of tests. Once I started the emulator and navigated to the firestore UI everything went south and performance immediately took a nose dive.

...
done writing [751500-752000] 500 result(s) in 0.023309 seconds [rate: 21451.152444/s]
done writing [752000-752500] 500 result(s) in 0.024066 seconds [rate: 20776.529436/s]
done writing [752500-753000] 500 result(s) in 0.022594 seconds [rate: 22129.835568/s]
done writing [753000-753500] 500 result(s) in 29.135182 seconds [rate: 17.161383/s]
done writing [753500-754000] 500 result(s) in 59.680087 seconds [rate: 8.378004/s]
...

I can reliably recreate this behavior. As long as the firestore UI is not loaded in a tab performance is wonderful. It's consistent and fast.

The UI seems to be doing something that tanks the firestore emulator and makes it incredibly slow. At least, in my setup it does.

samtstern commented 3 years ago

@williamhaley thank you for the detailed digging! The UI establishes multiple connections for the emulator. Some are regular real-time watchers to observer data. It also polls regularly for new collections.

Just to set expectations: it's not our goal to make the emulator "scale" under heavy traffic like a real Firestore instance. It's not meant to be a realistic load testing environment. Actually in most cases it's too fast to be realistic, but you've found an interesting edge case where it's slower than production.

I'll ask around and see if anyone has any ideas here.

williamhaley commented 3 years ago

Thanks @samtstern!

That totally makes sense regarding expectations. I'm not looking for the emulator to match a real Firestore instance, and to that end I'm totally OK with it being too fast 🚀 . More than anything I'm just interested in being able to see my import scripts run end to end (with a large volume) in a reasonable duration before I run them against production.

For a local developer experience I've tuned my import script to only import the first 1000-ish records of various collections to the firestore emulator just so that some demo data is available. But to test my full import process against prod it's a nice sense of assurance to see the script run through all the records against the emulator and (hopefully) surface any edge cases before hitting production.

Keeping away from the UI is a meaningful workaround on my end to get what I need out of the emulator for that.

janpansa commented 3 years ago

@williamhaley I also experience this issue. We are doing batch writes to firestore, about +- 15 000 to 20 000, and it slows down immensely on the emulator, but without the emulator it runs fine. I will test your solution of closing the emulator UI and the give feedback here if that solved it. Thanks for the discovery. I have also noted that once you have a few thousand entries in the Firestore emulator, the collection shows blank, even though the entries are there. Interesting.

Anyways, will give my feedback soon, which might help.

MidnightLightning commented 1 year ago

This seems potentially the same issue as #4578; after 10,000 records, the websocket traffic attempting to notify about the records starts throwing errors, and if more records are shoved in after it, the issue could be compounding (the system is spending more and more processing cycles just sending out notifications and can't ingest more records quickly)?

joehan commented 1 year ago

Hey all, while we will do our best to address issues like this, a friendly reminder that the Firestore emulator is not designed for scalability, and it is not expected to handle production-like workloads.

Duplicates https://github.com/firebase/firebase-tools/issues/4578, let's use that one for tracking this issue.