MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Implement tasks to roll out dbt models for the _standardized and _clean stages #55

Closed MattTriano closed 1 year ago

MattTriano commented 1 year ago

The _standardized stage dbt model (example) is where users should:

The _clean stage dbt model (example) will select the most recently modified version of each record (as distinguished by the surrogate key values). The dbt model for the _clean stage shouldn't need any information that wasn't already entered into the _standardized stage model, but copying that information over might be a bit hacky.

Possible implementations

Manual Intervention

The system could generate a partially complete _standardized model (and maybe also the _clean model) and then have the generating DAG fail with a message that instructs a user to go clean up the generated _standardized model stub (and _clean stub, if it exists).

Pros:

Cons:

Data Profiler implementation

It feels like it should be feasible to automate a lot of the standardization logic. For example, identifying a minimal spanning set of columns (for making a composite key) should be an algorithmic operation, but the only implementation I can think of right now would involve running many expensive queries (although this would be a one time cost per table, to generate the _standardized model file).

Pros:

Cons:

I guess that kind of settles the ultimate question (ie the user can't be completely freed from having to review the standardization model), but the result can be somewhere in between full automation and simple templating.