There are currently two options for validity column (i.e. "valid from" and "valid to") values:
based on "pipeline run timestamp" ➜ default
based on arbitrary value configured using the boundary_timestamp argument
I would like to see a third option:
based on source data column
In some cases an updated_at-style column in the source data "naturally" provides a good value for the validity columns.
It then makes sense to also detect changes based on this column, i.e. to base the "row hash" on a composite of the natural key and updated_at (i.e. the surrogate key).
Feature description
There are currently two options for validity column (i.e. "valid from" and "valid to") values:
boundary_timestamp
argumentI would like to see a third option:
In some cases an
updated_at
-style column in the source data "naturally" provides a good value for the validity columns.It then makes sense to also detect changes based on this column, i.e. to base the "row hash" on a composite of the natural key and
updated_at
(i.e. the surrogate key).Are you a dlt user?
Yes, I'm already a dlt user.
Use case
User asked for it: https://github.com/dlt-hub/dlt/issues/1601#issuecomment-2264975587
Proposed solution
Extend public API with
natural_key
andboundary_timestamp_column
arguments:Related issues
https://github.com/dlt-hub/dlt/issues/1601#issuecomment-2248545639 (specifically option 1)