Open we-ai opened 1 month ago
@we-ai @anthonypetersen @Davinkjohnson had a meeting today. Below are conclusions:
Diagram of suggested data flows:
Two steps might be necessary to finish the transition:
After the transition, Firestore data backups to Cloud Storage can be less frequent (eg 1 or 2 times per day) for lower costs.
Related info:
@danielruss and @anthonypetersen helped installing an instance of Stream Firestore to BigQuery in dev tier. For each streamed Firestore collection, 2 files are generated in BigQuery dataset:
A table of change logs. old_data
keeps data before an update, and data
keeps data after an update:
A view showing latest data:
In each of above files, the most fresh data streamed from Firestore collection are converted into strings and saved in column data
. To keep target BigQuery tables (participants
, boxes
, etc) synced with Firestore collections, the data in data
column need to be transformed, loaded and used to update target tables in BigQuery.
@we-ai is the transformation / loading something that needs to happen inside of BQ?
@we-ai is the transformation / loading something that needs to happen inside of BQ?
I don't see restrictions on the data handling. So I believe these can be done outside of BQ using cloud functions etc.
how often will these files be generated?
@we-ai Thanks for testing this out, Warren. I saw the firestore_export
dataset in dev. It looks like the *_latest
is just a view of data in *_changelog
, so there is no duplication. I can look into whether the transformation/update of the target tables could be done directly within BigQuery but I'm not familiar with this process yet. I agree that it might require a cloud function..
I would love to retain the timestamp
and operation
fields in the target tables so that we have an idea of when each row/record was last updated. This could make our QC/reporting more efficient if we use that information well.
After more checking of Stream Firestore to BigQuery extension, I feel it doesn't meet our needs well, because of below drawbacks (especially the first one):
data
column. We expected to get tables with each Firestore doc converted into a row in BQ format, but instead we get stringified docs from the extension and still have to do the heavy-lifting of data conversion (from Firestore to BigQuery).Without using extension "Stream Firestore to BigQuery", the data syncing (from Firestore to BigQuery) can be more flexible, consisting of 2 main steps:
__keys__.namespece
), etc. For sure we can add fields (timestamp, operations etc) in target tables based on our needs.@jacobmpeters @anthonypetersen @JoeArmani Please let me know if you have suggestions.
@we-ai can you please include some screenshots that help visualize what the output from Stream Firestore to BigQuery
looks like?
@we-ai can you please include some screenshots that help visualize what the output from
Stream Firestore to BigQuery
looks like?
Screenshots were posted above. I can post more if they're not visible or not clear enough.
Below is a screenshot of participants_raw_changelog
in firestore_export
dataset of dev tier:
Schema of the output table:
@we-ai of your two drawbacks, I'm not really that concerned about the second one... it might be tedious initially, but it's a setup that only needs to be done one time for our tables (obviously we will need to add the setup again if / when we add more tables)
Looking at the data
column from your screenshot, the data appears to stay in the correct format. It seems like we would need to have some kind of middleman code that listens for updates from Stream Firestore to BigQuery
that takes the update and writes it to the correct BQ table.
This option as well as other things you've mentioned above all seem to require a bit of extra code to achieve our final goal. My question then becomes which option is the right balance or worked required as well as ease of maintenance / accuracy in produced data.
Upgrade Firestore backups to be more cost effective. There were related discussions here. Detailed requirements are to be decided for the implementation.