Implement tasks to roll out dbt models for the _standardized and _clean stages

The _standardized stage dbt model (example) is where users should:

change column dtypes if desired,
standardize column values (e.g., zero-padding fixed width strings, upper-casing strings),
change column names if needed,
define the set of columns that uniquely identify a record (i.e., a primary or composite key) across stages,
generate a surrogate key from that uniquely identifying column set, and
give that surrogate key column an appropriate column name.

The _clean stage dbt model (example) will select the most recently modified version of each record (as distinguished by the surrogate key values). The dbt model for the _clean stage shouldn't need any information that wasn't already entered into the _standardized stage model, but copying that information over might be a bit hacky.

Possible implementations

Manual Intervention

The system could generate a partially complete _standardized model (and maybe also the _clean model) and then have the generating DAG fail with a message that instructs a user to go clean up the generated _standardized model stub (and _clean stub, if it exists).

Pros:

Would be the easiest to implement, and
would avoid imposing choices on the data representation that might require manual cleanup

Cons:

Would be somewhat difficult to document,
adds friction to the curation process.

Data Profiler implementation

It feels like it should be feasible to automate a lot of the standardization logic. For example, identifying a minimal spanning set of columns (for making a composite key) should be an algorithmic operation, but the only implementation I can think of right now would involve running many expensive queries (although this would be a one time cost per table, to generate the _standardized model file).

Pros:

Would make it largely frictionless to add a new data pipeline, at least up to the feature engineering stage, and
would be interesting to implement.

Cons:

Would take a while to implement,
would add a fair amount of complexity to the code, and
wouldn't ever be perfect, so the user would still always have to review the model.

I guess that kind of settles the ultimate question (ie the user can't be completely freed from having to review the standardization model), but the result can be somewhere in between full automation and simple templating.

MattTriano / analytics_data_where_house