ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Switch back to high availability tables #516

Closed jeancochrane closed 2 weeks ago

jeancochrane commented 2 weeks ago

This PR is the latest in a long effort to get high availability tables working with our dbt configuration (search "high availability" in our PRs for the full history).

While deploying previous versions of this PR, we had seen strange behavior whereby dbt was deleting the underlying S3 immediately after creating them. We were able to debug this and determine that it happens only for pre-existing tables when switching from non-high-availability to high availability and from the schema_table naming scheme to schema_table_unique: In these cases, the Glue database retains a reference to an old version of the table with the schema_table naming scheme (e.g. /location/access), and then when a high-availability version is created to supercede that old version with a unique naming scheme (e.g. /location/access/<uuid>), dbt can end up deleting the old version and with it the whole prefix that contains the new data (e.g. /location/access).

In order to properly handle this behavior, we need to make sure to delete all of our Glue databases that referenced materialized tables before deploying this PR. If that proves to be too much trouble, we could also consider just rerunning the new DAG build a couple of times, at which point the reference to the old schema_table version of the database will get deleted and subsequent runs will correctly clean up old versions without wiping away the entire S3 prefix that contains the table data.