define the set of columns that uniquely identify a record (i.e., a primary or composite key) across stages,
generate a surrogate key from that uniquely identifying column set, and
give that surrogate key column an appropriate column name.
The _clean stage dbt model (example) will select the most recently modified version of each record (as distinguished by the surrogate key values). The dbt model for the _clean stage shouldn't need any information that wasn't already entered into the _standardized stage model, but copying that information over might be a bit hacky.
Possible implementations
Manual Intervention
The system could generate a partially complete _standardized model (and maybe also the _clean model) and then have the generating DAG fail with a message that instructs a user to go clean up the generated _standardized model stub (and _clean stub, if it exists).
Pros:
Would be the easiest to implement, and
would avoid imposing choices on the data representation that might require manual cleanup
Cons:
Would be somewhat difficult to document,
adds friction to the curation process.
Data Profiler implementation
It feels like it should be feasible to automate a lot of the standardization logic. For example, identifying a minimal spanning set of columns (for making a composite key) should be an algorithmic operation, but the only implementation I can think of right now would involve running many expensive queries (although this would be a one time cost per table, to generate the _standardized model file).
Pros:
Would make it largely frictionless to add a new data pipeline, at least up to the feature engineering stage, and
would be interesting to implement.
Cons:
Would take a while to implement,
would add a fair amount of complexity to the code, and
wouldn't ever be perfect, so the user would still always have to review the model.
I guess that kind of settles the ultimate question (ie the user can't be completely freed from having to review the standardization model), but the result can be somewhere in between full automation and simple templating.
The
_standardized
stage dbt model (example) is where users should:The
_clean
stage dbt model (example) will select the most recently modified version of each record (as distinguished by the surrogate key values). The dbt model for the_clean
stage shouldn't need any information that wasn't already entered into the_standardized
stage model, but copying that information over might be a bit hacky.Possible implementations
Manual Intervention
The system could generate a partially complete
_standardized
model (and maybe also the_clean
model) and then have the generating DAG fail with a message that instructs a user to go clean up the generated_standardized
model stub (and_clean
stub, if it exists).Pros:
Cons:
Data Profiler implementation
It feels like it should be feasible to automate a lot of the standardization logic. For example, identifying a minimal spanning set of columns (for making a composite key) should be an algorithmic operation, but the only implementation I can think of right now would involve running many expensive queries (although this would be a one time cost per table, to generate the
_standardized
model file).Pros:
Cons:
I guess that kind of settles the ultimate question (ie the user can't be completely freed from having to review the standardization model), but the result can be somewhere in between full automation and simple templating.