MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Implement a strategy for dropping "temp_" tables #48

Closed MattTriano closed 1 year ago

MattTriano commented 1 year ago

The "temp_" tables are useful when developing expectations, but outside of that, they will just be recreated every time fresh data is ingested.

I see two possible courses of action and I'm not sure which I prefer yet.

  1. Implement a task in update DAG that drops the "temp_" table ff a suite of expectations already exists for the data set, otherwise leave it. OR
  2. Implement a manually or periodically run maintenance DAG that drops all "temp_" tables.

The former is the ideal long-run solution (i.e., after expectations for the raw data ingestion are developed and mature), but in the short term it would complicate the process of editing a new suite of expectations.

MattTriano commented 1 year ago

About a month ago, I implemented a cleanup DAG that can be manually run to drop all data_raw.temp_* tables, but I left this issue open as I wasn't sure if it was better to integrate this into the think it's better to integrate this drop into the update_socrata_table task_group (or update_xyz_table task_group if/when I develop connectors to other data sources).

But now I'm pretty confident that I want to leave the cleanup decision to the user, as it's been useful to have a clean pull of the data to check (as opposed to the other table-version in data_raw, which contains all distinct versions of retrieved records). So I'm going to close this issue out.