firebase / extensions

Source code for official Firebase extensions
https://firebase.google.com/products/extensions
Apache License 2.0
893 stars 383 forks source link

🐛 [Stream Firestore to BigQuery] Events stop streaming from firestore to bigquery, but fixed through extension update? #2198

Open leighajarett opened 1 month ago

leighajarett commented 1 month ago

Steps to reproduce:

Several months ago the extension started randomly stopping streaming records into BigQuery. This seems to be nearly completely stopped until we upgrade the extension to a new version. We don't see any errors in the logs or anything. We have one version of the extension that streams into a non-partitioned table and one that streams into a partitioned table. This only seems to affect the partitioned table.

Expected result

Records continuously stream into BigQuery without interruption.

Actual result

Records are omitted from the BigQuery table until we upgrade the version.

puf commented 1 month ago

Hey folks, I'm working with @leighajarett on this problem. What we see is that the extension works fine for is for a while, and stops writing most events (our Firestore write volume is pretty constant). When we install a new version of the extension, it works again - until it stops later.

image

Any idea what could be going on to cause this, or even how we can troubleshoot it?

pr-Mais commented 1 month ago

@puf does this chart represent exports count in BigQuery?

leighajarett commented 1 month ago

It represents the number of events per day, its a count of the records in the table

leighajarett commented 1 week ago

Just to add some more information here - we pinpointed a specific event that is missing from the bigquery table.

In the logs, we can see this error

Screenshot 2024-11-14 at 1 48 18 PM

We're wondering if things are timing out somewhere? Maybe from an overload of events?

puf commented 1 week ago

We (Leigha, myself and our team) have been analyzing a bit further, and these metrics from the Cloud Run task queue associated with one of our extension instances seems pretty conclusive:

CleanShot 2024-11-14 at 11 45 25@2x

In the top chart you can see that:

In the bottom chart you see the size of the task queue, which grows to 500 million, which is presumably its maximum. So... the queue is just not able to process the tasks that the extension is adding to it.

We've just changed the configuration of this queue to have a Max rate of 500/s (the maximum we can set) to see if that allows it to drain the backlog of tasks, but given the rate at which we're adding tasks that likely won't be enough for long.

We've also upgraded one of our instances of this extension to the new 0.1.56 version, and no longer see the same errors in our logs for that instance.

puf commented 2 days ago

Five days in, we're still seeing the events being streamed into BigQuery, so 🎉