Open nathanhealey opened 11 months ago
I found a few problems with this issue:
After further triage, I believe the underlying issue is the amount of time the firebase-admin SDK spends preparing sends and processing received data. The console shows that the P50 Firestore read time is ~90ms, which does not line up with the tracing shown above. I profiled a single API request on my development machine, which shows similar latencies to the GAE instances. For a single API invocation with a total latency of 1781ms, the server spends at least 684ms in Firebase code, and at least another 447ms waiting for IO, of which Firestore is the only source. For comparison, the aggregate of handling and responding to the client request, all GraphQL processing, and the entirety of user code, is 201ms.
You can see in the flamegraph that the second waterfall of requests takes ~300ms from start of processing to the end, which lines up pretty well with what I see in the tracing from the first post, and the observation of ~100ms read times from Firestore.
Annotated flamegraph of the above findings.
I verified that the API is CPU-bound by running tests on GAE with 600MHz and 1.2GHz processors. The faster instance responded to API requests almost exactly twice as fast.
Due to this being a CPU limitation, rather than waiting on I/O, the NodeJS server cannot handle multiple requests well. This is limiting the server to approximately 1 RPS - which I think we can all agree, is quite low. My use case is requests coming in infrequent, sporadic bursts of 3-6RPS. Too short for responsive scaling to make any difference.
This all being said: Is there anything I can do to drastically cut down on processing time? Is this due to the high number of concurrent requests, or is this inherent to the library? Is there configuration I'm missing?
This is for anyone facing this issue as well: I've done all the optimizations I think I can do. I've managed to get API calls down to a P50 of 800ms. I've done that through two major optimizations:
The conclusions I've reached based on the above are:
I use Firestore as my database for a simple backend server, and very slow reads are causing my API requests to take several seconds to complete.
My backend is an always-running GraphQL server running in Google App Engine, so it is not suffering from cold-start issues. The GAE instance and the selected Firestore instance are both running in US-central (nam5), so transfer delay should be negligible. Document sizes are 2-10KB, so large transfers should not be an issue. All other external REST calls complete in 50-120ms, so I don't believe it's a network saturation issue. My keys are generated UUIDs, so I don't believe it's a hotspotting issue.
My typical API processing makes 1-5 calls in parallel to documents to determine what the user is requesting, then gets 20-80 documents in parallel to build the response - i.e. a single waterfall. Using @opentelemetry/instrumentation-grpc@0.27.0, I am tracing these calls from within the server (and batching the traces). Typically the first batch of reads completes in 300-700ms, and the second batch takes 600-1200ms. This far exceeds what I would consider viable performance.
For the user to not be interrupted, I try to keep my P50 performance <300ms. With these numbers, I'm not even close. API calls that take 5+ seconds are not uncommon.
I feel like I've eliminated every possible variable, and all that is left is: a) an issue with the node package, perhaps straining under 20+ simultaneous reads, or b) misrepresented data coming from the instrumentation-grpc package, or c) mismatched expectations of Firestore's capabilities. Unfortunately, it seems Firestore doesn't publish any SLAs or expectations. Is a single Firestore read taking 300ms expected, or an outlier needing investigation? I do see some reads happening in <40ms, which is far closer to my expectation.
Here's one of the simplest APIs I have, and it completes in 568ms. There are 56 reads across 6 batches. Shortest read time is 23ms (the first one), longest is 307ms. Most reads are >180ms.
Here's a taste of a more complex API that has 41 reads, and completes in 3298ms:
For reference, here's an API call that doesn't call Firestore. It completes in 9ms.
From these traces, it seems that the read times scales linearly with the number of reads in flight. Information online suggests that the firebase-admin package uses a GRPC pool under the hood, so I would expect the read time to be static until the GRPC connection pool is exhausted, then scale linearly after that point. Is there a way to configure or inspect what's happening with the GRPC connection pool?
Here's the code used to initialize the package.
firestore.ts
TL;DR: How do I get my read times lower, or is this as good as I can expect them to be?