dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

`scd2` validity column values based on column in source data #1776

Open jorritsandbrink opened 3 weeks ago

jorritsandbrink commented 3 weeks ago

Feature description

There are currently two options for validity column (i.e. "valid from" and "valid to") values:

I would like to see a third option:

In some cases an updated_at-style column in the source data "naturally" provides a good value for the validity columns.

It then makes sense to also detect changes based on this column, i.e. to base the "row hash" on a composite of the natural key and updated_at (i.e. the surrogate key).

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

User asked for it: https://github.com/dlt-hub/dlt/issues/1601#issuecomment-2264975587

Proposed solution

Extend public API with natural_key and boundary_timestamp_column arguments:

@dlt.resource(
    write_disposition={
        "disposition": "merge",
        "strategy": "scd2",
        "natural_key": "customer_id",
        "boundary_timestamp_column": "updated_at"
    }
)

Related issues

https://github.com/dlt-hub/dlt/issues/1601#issuecomment-2248545639 (specifically option 1)