We want the Synapse offshore wind data to be updated weekly in BQ. There are a few ways we could do this. For all options, we'll need to archive the data in GCS so we can fall back onto older archives if a change in the data breaks the ETL.
First, we'll need to write a new dgm archiver that pulls the offshore wind table data using pyairtable. This archiver will save the data to gs://dgm-archives each week. (10 hrs)
Option 1: Manual ETL updates
We pull the latest archive version number, add it to the ETL code, open a PR. If the CI passes, merge it in and the changes will be propagated to dev. This is the simplest option but not much automation and we'll probably forget now and then to update the data! I could see us doing this monthly or quarterly. (Probably 30 min per update)
Option 2: Github action
Create a Github action that runs weekly (after the archive) that pulls the latest version for datasets we want to update automatically, create a configuration file with these versions and runs the ETL using the config file. Should we have this run on dev and main? (15 hrs, 0ish per update).
Questions
Do we want to the data to be updated in dev and prod weekly?
Re-scoped this issue to be only about a weekly manual data update. The question of automated updates was moved to #359. This reduced scope issue was closed by #355
We want the Synapse offshore wind data to be updated weekly in BQ. There are a few ways we could do this. For all options, we'll need to archive the data in GCS so we can fall back onto older archives if a change in the data breaks the ETL.
First, we'll need to write a new dgm archiver that pulls the offshore wind table data using
pyairtable
. This archiver will save the data togs://dgm-archives
each week. (10 hrs)Option 1: Manual ETL updates
We pull the latest archive version number, add it to the ETL code, open a PR. If the CI passes, merge it in and the changes will be propagated to dev. This is the simplest option but not much automation and we'll probably forget now and then to update the data! I could see us doing this monthly or quarterly. (Probably 30 min per update)
Option 2: Github action
Create a Github action that runs weekly (after the archive) that pulls the latest version for datasets we want to update automatically, create a configuration file with these versions and runs the ETL using the config file. Should we have this run on
dev
andmain
? (15 hrs, 0ish per update).Questions
Do we want to the data to be updated in dev and prod weekly?