Upgrade Firestore data backups and data syncs

we-ai commented 1 month ago

Upgrade Firestore backups to be more cost effective. There were related discussions here. Detailed requirements are to be decided for the implementation.

we-ai commented 1 month ago

@we-ai @anthonypetersen @Davinkjohnson had a meeting today. Below are conclusions:

There should be 2 separated data flows for flexibility: Firestore data backups to Cloud Storage; Firestore data syncs to BigQuery
Stream Firestore to BigQuery to achieve realtime data updates to BigQuery

Diagram of suggested data flows:

Two steps might be necessary to finish the transition:

Add data stream (from Firestore to BigQuery) functionality and keep our current data flow strategy (Firestore --> Cloud Storage --> BigQuery). Updates to BigQuery will come form data streaming from Firestore and data backups in Cloud Storage. We need to evaluate impacts of realtime data updates to BigQuery and make adjustments if needed.
If everything works well in step 1, remove data loading from Cloud Storage to BigQuery. Updates to BigQuery will only come form data streaming from Firestore.

After the transition, Firestore data backups to Cloud Storage can be less frequent (eg 1 or 2 times per day) for lower costs.

Related info:

Slides used earlier for discussion

we-ai commented 1 month ago

@danielruss and @anthonypetersen helped installing an instance of Stream Firestore to BigQuery in dev tier. For each streamed Firestore collection, 2 files are generated in BigQuery dataset:

A table of change logs. old_data keeps data before an update, and data keeps data after an update:
A view showing latest data:

In each of above files, the most fresh data streamed from Firestore collection are converted into strings and saved in column data. To keep target BigQuery tables (participants, boxes, etc) synced with Firestore collections, the data in data column need to be transformed, loaded and used to update target tables in BigQuery.

anthonypetersen commented 1 month ago

@we-ai is the transformation / loading something that needs to happen inside of BQ?

we-ai commented 1 month ago

@we-ai is the transformation / loading something that needs to happen inside of BQ?

I don't see restrictions on the data handling. So I believe these can be done outside of BQ using cloud functions etc.

anthonypetersen commented 1 month ago

how often will these files be generated?

jacobmpeters commented 1 month ago

@we-ai Thanks for testing this out, Warren. I saw the firestore_export dataset in dev. It looks like the *_latest is just a view of data in *_changelog, so there is no duplication. I can look into whether the transformation/update of the target tables could be done directly within BigQuery but I'm not familiar with this process yet. I agree that it might require a cloud function..

I would love to retain the timestamp and operation fields in the target tables so that we have an idea of when each row/record was last updated. This could make our QC/reporting more efficient if we use that information well.

we-ai commented 3 weeks ago

After more checking of Stream Firestore to BigQuery extension, I feel it doesn't meet our needs well, because of below drawbacks (especially the first one):

Updated data are converted into strings and saved in data column. We expected to get tables with each Firestore doc converted into a row in BQ format, but instead we get stringified docs from the extension and still have to do the heavy-lifting of data conversion (from Firestore to BigQuery).
The extension is not flexible enough. For each of our Firestore collection, we need to install the extension once. We now have 18 collections in Firestore.

Without using extension "Stream Firestore to BigQuery", the data syncing (from Firestore to BigQuery) can be more flexible, consisting of 2 main steps:

Listen to data changes in Firestore. This step is easy since cloud functions can listen to Firestore changes. The "Stream Firestore to BigQuery" extension can also do this job, but it's heavier with drawbacks mentioned above.
Convert data from Firestore format into BigQuery format, and update to target tables. This is the challenging part. But we have options to put more logics during data conversion: data flattening (so that we don't have to do scheduled/manual data flattening), data cleanups (e.g. remove fields like __keys__.namespece), etc. For sure we can add fields (timestamp, operations etc) in target tables based on our needs.

@jacobmpeters @anthonypetersen @JoeArmani Please let me know if you have suggestions.

anthonypetersen commented 3 weeks ago

@we-ai can you please include some screenshots that help visualize what the output from Stream Firestore to BigQuery looks like?

we-ai commented 3 weeks ago

@we-ai can you please include some screenshots that help visualize what the output from Stream Firestore to BigQuery looks like?

Screenshots were posted above. I can post more if they're not visible or not clear enough.

we-ai commented 3 weeks ago

Below is a screenshot of participants_raw_changelog in firestore_export dataset of dev tier:

Schema of the output table:

anthonypetersen commented 3 weeks ago

@we-ai of your two drawbacks, I'm not really that concerned about the second one... it might be tedious initially, but it's a setup that only needs to be done one time for our tables (obviously we will need to add the setup again if / when we add more tables)

Looking at the data column from your screenshot, the data appears to stay in the correct format. It seems like we would need to have some kind of middleman code that listens for updates from Stream Firestore to BigQuery that takes the update and writes it to the correct BQ table.

This option as well as other things you've mentioned above all seem to require a bit of extra code to achieve our final goal. My question then becomes which option is the right balance or worked required as well as ease of maintenance / accuracy in produced data.

episphere / connect

Upgrade Firestore data backups and data syncs #1095