Open sergiocampama opened 8 months ago
I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.
Not sure if this is relevant, but I noticed that the tcp connection stops receiving any bytes after this condition is triggered... wondering if there's something here that's triggering some TCP blockage because of the amount of network happening...
I think I have an idea of why it starts happening, but can't yet explain it in code... what happens is that after some event, we'll create some changelog documents on a collection like "/changelog/user_id", which might contain 100s or 1000s of documents created in a very short span of time, but outside of any database transaction, so it looks like it gets 100s of serial updates of 1 document each... for some reason, this seems to lock firestore... if I remove this changelog subscription, I can't reproduce the hang any more
Hi @sergiocampama, thank you for filing this issue. Quick question, what were the steps when you produced this issue? Terminating firestore hangs, how is it related to " there's an overload of firestore document changes."? Are you terminating the firestore right after 100 documents mutation?
Through some mechanism, we're creating/modifying 1000s of changelog documents. We have a listener for that collection of changelogs, which is updated 1 at a time, since the changelog documents are created from different cloud functions (they're onModified triggers for another set of documents).
At some point during this listening, it seems firestore locks up and stops receiving any kind of updates. We notice this lack of updates because the UI does not reflect that a document which should have been deleted, is not removed from the screen.
It is at this point that if you call Firestore.terminate, that since it is somehow locked, the terminate's callback won't get called, and it will get stuck on that backtrace shown above.
My colleague has pointed out it might be because LevelDbRemoteDocumentCache
is creating concurrent tasks. It might have created 1000 tasks in your case, which will definitely overflow the default queue as well as the new threads that created after.
Could you please try changing the LevelDbRemoteDocumentCache::LevelDbRemoteDocumentCache
function to use CreateSerial
instead, and see it resolves the crash
executor_ = Executor::CreateSerial("com.google.firebase.firestore.query");
Made the change locally, but it didn't help, still getting stuck. I also replaced the cacheSettings with MemoryCacheSettings(garbageCollectorSettings: MemoryLRUGCSettings())
, which bypasses the LevelDB layer, and it's also getting stuck on the same events.
it's not looking like a dispatch issue, more like one of the GrpcStreams is getting blocked
Ruling out the persistence, the next suspect is the grpc. In the next ios sdk release we are doing a significant upgrade of the grpc dependency, which may fix the issue. Would you like to try again after the new release? Or, you can try applying https://github.com/firebase/firebase-ios-sdk/pull/12398 locally to upgrade grpc in advance.
Could you please provide debug log as well? It might give us some clue.
No dice on updated gRPC.
I've enabled logging, and it's a bit massive and contains some sensitive information for a public setting, and I also don't think it provides much value, since there are no errors, just a massive list of watch/unwatch/response logs... I could share it if you start an email thread with me at sergiocampama at gmail dot com...
but I did found this... when the connection seems to hang because of the massive number of document updates, I can still send requests to the backend to listen to new documents (so it's not a dispatch queue issue), but the backend does not reply with a response for the initial document read nor the registered target_id, so it's starting to look like the WatchStream from Firestore's Datastore is getting hung on the backend side (could be head of line blocking on that h2 stream)... if this is indeed a possibility, it seems harder to debug from your side... not sure if I can give you any identifier for the connection so that someone could look more into it...
it would be great if the firestore grpc channel had some sort of ping mechanism to detect that the connection is hanging and to try to force recover it by reregistering the target_ids, so that we don't lose uptime just on that single connection blocking... if not, force closing the watch_stream and write_stream from remote_store would also help so that we can at least recover by letting the user perform the action that would involve firestore's termination
for now we're mitigating by trying to reduce the number of document reads to not trigger this hang, but given we don't have a root cause yet, we fear it might still be triggered in the future.
Actually, looking at the logs for the last response log, I see this just after it:
10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): operation of type 2 failed
10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): finishing the underlying call
10.17.0 - [FirebaseFirestore][I-FST000001] GrpcStream('168259010'): fast finishing 1 completion(s)
(we're using 10.22 Firestore but 10.17 FirebaseCore for some reason)
setting a breakpoint in datastore.cc's PollGrpcQueue to when ok == false
, then tried tracing the WriteStream's write call to see if I could find where we're parsing/getting the error from the C++ code, but it's just so difficult to traverse... so I'm unsure if we're getting the error from the backend or from the network stack, and there are no other error logs...
is there a way to get more information on why the grpc completion queue failed to write to the stream?
I'm fairly confident that when we get GrpcStream('168259010'): operation of type 2 failed
, the stream effectively dies and firestore gets into a corrupted state where the only recourse is to restart the app
just tested that I could continue to use the write stream by creating new firestore documents after getting the operation of type 2 failed
, so this is specifically about Datastore's WatchStream getting completely stuck when it fails to write a message, and firestore failing to recover from this state even after attempting to terminate the firestore instance, hopefully this is enough information that you can create a test to simulate this condition (just the failure to write on the watch stream) to reproduce...
ideally, we should be able to find the root cause of the write failure, but sans that, allowing firestore to recover from this state without requiring an app restart should be the higher priority IMO
One final update: I was able to create a small repro case that results in the same behavior. This overloads the WatchStream until there's a read error (GrpcStream('6000032595f0'): operation of type 1 failed
), and then when you tap on the Run stuff button on the simulator, it attempts to terminate Firestore but it can't and gets stuck
not sure if this is the same failure mode as what I'm seeing above, but it looks similar in that if the watch or write streams are overloaded, firestore gets corrupted with no way of recovering
Hi @sergiocampama, thank you for the repro app. I am able to reproduce the "operation of type 2 failed" error.
awesome, thanks for confirming... let me know when you get a chance to review with the team, to see if there are any quick workarounds that we can use
ok, found a workaround that works for us for now...
if I replace GrpcStream::NewCompletion
with the following in grpc_stream.cc
:
std::shared_ptr<GrpcCompletion> GrpcStream::NewCompletion(
Type tag, const OnSuccess& on_success) {
// Can't move into lambda until C++14.
GrpcCompletion::Callback decorated =
[this, on_success](bool ok,
const std::shared_ptr<GrpcCompletion>& completion) {
RemoveCompletion(completion);
if (ok) {
if (on_success) {
on_success(completion);
}
} else {
LOG_DEBUG("GrpcStream('%s'): operation of type %s failed", this, completion->type());
if (on_success) {
on_success(completion);
}
// Use the same error-handling for all operations; all errors are
// unrecoverable.
// OnOperationFailed();
}
};
// For lifetime details, see `GrpcCompletion` class comment.
auto completion =
GrpcCompletion::Create(tag, worker_queue_, std::move(decorated));
completions_.push_back(completion);
return completion;
}
I can avoid the stream crapping out.... it won't get the listen updates, but I can pair this with a polling read every n seconds, and if it fails, I can call terminate, which won't lock up any more.
In fact, with that code above, which looks terrible since it's calling on_success for the error case, I'm seeing that the connection continues working normally, it never locks up, and it never locks up the terminate call, and it sort of recovers on its own... sure, I'll lose some events, but in those cases we can just call terminate and restart the listeners... I disagree with "all errors are unrecoverable"
more information about the "failure" message... it looks like the request itself succeeded? it's unclear why the "ok" boolean is set to false
tracing to where the ok success status comes, was able to find this error:
(lldb) p grpc_error_std_string(error)
(std::string) "UNKNOWN:Attempt to send message after stream was closed. {file:\"/Users/kaipi/Developer/cpl/copilot-ios/main/Pods/gRPC-Core/src/core/lib/surface/call.cc\", file_line:1083, created_time:\"2024-03-13T19:01:29.963325-03:00\"}"
after that, only a single bit is propagated to signal the failure of the completion.
Figured out how to enable grpc tracing, was able to see that we are indeed getting a RST_STREAM from the backend with code 0
here is the trace logs: https://gist.githubusercontent.com/sergiocampama/c7ebecd21f42ca948358cd45bc1d5a77/raw/0fd3b4e2e8ff8c5639c9c3bca48be1be9a7d8f1e/gistfile1.txt
what I was able to find was that the watch stream does indeed get a RST_STREAM message, and then we restart that stream, as you can see there are multiple SEND_INITIAL_METADATA{:path: /google.firestore.v1.Firestore/Listen
events.
what I think is happening is that on overload, the grpc++ batch processor continues sending messages to the backend, after the stream has closed... so it's not detecting that event and canceling the subsequent messages, that generates a discardable error, but an error nonetheless, and that indeed is taken as a (wrong) signal that the stream is broken and that we need to close it, which in turns puts Firestore into a broken state where it can not be recovered from
Created a gRPC++ issue here: https://github.com/grpc/grpc/issues/36115
Created a fork with this change for now [1]. Hopefully you can all reproduce and provide a more robust fix over time, but this is sufficient for our needs for now.
1: https://github.com/copilotmoney/firebase-ios-sdk/commit/b4d785d5adb2d25d68c26ea5b1ce87a4c224e038
This is amazing. Thank you for sharing the debugging process, it is very informative! I will bring this to our team discussion and keep you updated.
Hi, I'm from the gRPC team. I don't completely understand the issue here, but from a gRPC perspective - if a write on a stream/RPC is failing, there is no point in writing to that stream again. Whether an RPC is done or not is decided by what's happening on reads and whether a status is available for that RPC. In this case, a RST_STREAM
has been received on the associated HTTP2 stream so the stream is indeed done and a status should be available for the RPC.
Please feel free to reach out to me (or schedule a meeting) in case there are any questions as to how gRPC works.
Thanks @yashykt, I was basing my batching assumption on the error being generated in the void FilterStackCall::BatchControl::PostCompletion()
method in call.cc
. I could not find any batching being done in Firestore, and it seems to be using the grpc_completion_queue mechanism, not sure if that's where the issue lies, or a red herring.
I'll let the Firestore experts figure it out then.
Description
There are certain scenarios where I'll call Firestore's terminate function, and it will hang, never calling the callback. I haven't been able to create a reproducer, but can consistently get my product's code to get into this state (not shareable for obvious reasons).
This appears to happen when there's an overload of firestore document changes. I am able to see that the
com.google.firebase.firestore
queue's target queue iscom.apple.root.default-qos.overcommit
though not sure what this means. It looks like the queue starts with this value set as the target queue, could be a red herring.Here's a screenshot of the backtrace of where firestore hangs... It looks like it's unable to close the gRPC watch stream for the RemoteStore, stuck on WaitUntilOffQueue.
I'm basically stuck trying to debug this to see where things go wrong and what I can do to avoid this. I'm very unfamiliar with this codebase, so I don't even know what to look for to find where the code is getting stuck. Any help with pointers on where to look or what to check would be very appreciated.
Reproducing the issue
Can't share reproduction steps, as I can't share the relevant code.
Firebase SDK Version
10.22
Xcode Version
15.3
Installation Method
CocoaPods
Firebase Product(s)
Firestore
Targeted Platforms
iOS
Relevant Log Output
If using Swift Package Manager, the project's Package.resolved
None
If using CocoaPods, the project's Podfile.lock
Can't share