TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.61k stars 141 forks source link

SCD Type 2 from periodic snapshots #2009

Open plaflamme opened 7 months ago

plaflamme commented 7 months ago

This PR: https://github.com/TobikoData/sqlmesh/pull/1997 adds a new way of maintaining a SCD Type 2 model from detecting changes to the source table's columns.

This issue is for tracking the idea of extending this behaviour to build a SCD Type 2 from a table that contains periodic snapshots of the source data.

Imagine a source table that looks like this:

name,price
foo,20

And some periodic process that takes snapshot of this data and makes those snapshots available in another table, e.g.:

name,price,snapshot_date
foo,20,2024-01-01
foo,20,2024-01-02
foo,21,2024-01-03
foo,21,2024-01-04
food,21,2024-01-05

These snaphsots allow tracking the changes that were made to the individual rows (by comparing the values), but it also contains a timestamp that can be used to determine when those changes occured. As such, a SCD Type 2 dimension can be built from this data which might look like this:

name,price,valid_from,valid_to
foo,20,2024-01-01,2024-01-02
foo,21,2024-01-03,2024-01-04
food,21,2024-01-05,NULL

Ideally, the new SCD_TYPE_2_BY_COLUMN model kind would allow specifying a column (snapshot_date in this case) as the timestamp to use for determining when a row has changed instead of using execution_time.

gavin-dawn-capital commented 5 months ago

Adding a +1, would be ideal to have a 3rd model that combines the two SCD2 model types where I can specify the date (in this case a snapshot date or loaded at) and the columns to check for changes, when changes are detected the date column is used if not row is ignored

jonathan-ostrander commented 5 months ago

To take this one step farther, generating an SCD Type 2 table from a generic history would be helpful.