Write-audit-publish (WAP)

MaxHalford commented 11 months ago

Also called blue-green deployment in the software engineering world.

When a refresh happens, the views are modified in-place. During this refresh, there are thus views that are not in sync. This could cause race conditions and tricky situations. For instance, you might refresh 10 views. And then there's an error raised for the 11th view. But you still have 30 views to go. So you now out-of-sync views, which is not a good situation.

One solution to this could be to create a temporary schema, create the views there, and then switch the target schema with the temporary schema. This must be known way of doing stuff in the database world, so a little bit of research could prove worthwhile.

MaxHalford commented 10 months ago

See here for blue-green deployment with Snowflake/dbt, and this for how to do it with BigQuery

MaxHalford commented 10 months ago

There's some good stuff here too. The way I see it, there are two options:

Transactions. We could wrap every CREATE DATASET in a massive transaction. The transaction is only committed if the whole DAG succeeds. This is a clean solution. The only thing is that I'm doubtful of the ability to do transactions with many massive queries. To be explored.
Create a copy of each table in the destination dataset with a __staging suffix. Then do a transaction to replace the current production tables with the staging tables, and drop the staging tables. The issue with this is that there will be __staging tables left behind if the DAG fails at some point. But maybe this is a good thing for inspecting why a particular view failed.

My initial gut feeling was to create all tables in a dedicated dataset, and then switch datasets. You can do that with Snowflake and DuckDB, but not with BigQuery. I think it's important to find a strategy that works with every database we think we'll support.

MaxHalford commented 8 months ago

Done! See the --wap flag

carbonfact / lea

Write-audit-publish (WAP) #9