episphere / connect

Connect API for DCEG's Cohort Study
10 stars 5 forks source link

Upgrade Firestore data backups and data syncs #1095

Open we-ai opened 1 month ago

we-ai commented 1 month ago

Upgrade Firestore backups to be more cost effective. There were related discussions here. Detailed requirements are to be decided for the implementation.

we-ai commented 1 month ago

@we-ai @anthonypetersen @Davinkjohnson had a meeting today. Below are conclusions:

Diagram of suggested data flows:

image

Two steps might be necessary to finish the transition:

  1. Add data stream (from Firestore to BigQuery) functionality and keep our current data flow strategy (Firestore --> Cloud Storage --> BigQuery). Updates to BigQuery will come form data streaming from Firestore and data backups in Cloud Storage. We need to evaluate impacts of realtime data updates to BigQuery and make adjustments if needed.
  2. If everything works well in step 1, remove data loading from Cloud Storage to BigQuery. Updates to BigQuery will only come form data streaming from Firestore.

After the transition, Firestore data backups to Cloud Storage can be less frequent (eg 1 or 2 times per day) for lower costs.

Related info:

we-ai commented 1 month ago

@danielruss and @anthonypetersen helped installing an instance of Stream Firestore to BigQuery in dev tier. For each streamed Firestore collection, 2 files are generated in BigQuery dataset:

In each of above files, the most fresh data streamed from Firestore collection are converted into strings and saved in column data. To keep target BigQuery tables (participants, boxes, etc) synced with Firestore collections, the data in data column need to be transformed, loaded and used to update target tables in BigQuery.

anthonypetersen commented 1 month ago

@we-ai is the transformation / loading something that needs to happen inside of BQ?

we-ai commented 1 month ago

@we-ai is the transformation / loading something that needs to happen inside of BQ?

I don't see restrictions on the data handling. So I believe these can be done outside of BQ using cloud functions etc.

anthonypetersen commented 1 month ago

how often will these files be generated?

jacobmpeters commented 1 month ago

@we-ai Thanks for testing this out, Warren. I saw the firestore_export dataset in dev. It looks like the *_latest is just a view of data in *_changelog, so there is no duplication. I can look into whether the transformation/update of the target tables could be done directly within BigQuery but I'm not familiar with this process yet. I agree that it might require a cloud function..

I would love to retain the timestamp and operation fields in the target tables so that we have an idea of when each row/record was last updated. This could make our QC/reporting more efficient if we use that information well.

we-ai commented 3 weeks ago

After more checking of Stream Firestore to BigQuery extension, I feel it doesn't meet our needs well, because of below drawbacks (especially the first one):

Without using extension "Stream Firestore to BigQuery", the data syncing (from Firestore to BigQuery) can be more flexible, consisting of 2 main steps:

@jacobmpeters @anthonypetersen @JoeArmani Please let me know if you have suggestions.

anthonypetersen commented 3 weeks ago

@we-ai can you please include some screenshots that help visualize what the output from Stream Firestore to BigQuery looks like?

we-ai commented 3 weeks ago

@we-ai can you please include some screenshots that help visualize what the output from Stream Firestore to BigQuery looks like?

Screenshots were posted above. I can post more if they're not visible or not clear enough.

we-ai commented 3 weeks ago

Below is a screenshot of participants_raw_changelog in firestore_export dataset of dev tier:

image

Schema of the output table:

image
anthonypetersen commented 3 weeks ago

@we-ai of your two drawbacks, I'm not really that concerned about the second one... it might be tedious initially, but it's a setup that only needs to be done one time for our tables (obviously we will need to add the setup again if / when we add more tables)

Looking at the data column from your screenshot, the data appears to stay in the correct format. It seems like we would need to have some kind of middleman code that listens for updates from Stream Firestore to BigQuery that takes the update and writes it to the correct BQ table.

This option as well as other things you've mentioned above all seem to require a bit of extra code to achieve our final goal. My question then becomes which option is the right balance or worked required as well as ease of maintenance / accuracy in produced data.