dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.25k stars 1.53k forks source link

[Feature] option to generate dbt_scd_id as an integer column instead of a string for performance improvements #10300

Open graciegoheen opened 2 weeks ago

graciegoheen commented 2 weeks ago

Is this your first time submitting a feature request?

Describe the feature

When dbt generates your snapshot, one of the meta-fields it creates is dbt_scd_id - a unique key generated for each snapshotted record, used internally by dbt.

Currently, dbt_scd_id is a string because of the hashing function used. dbt_scd_id is a combo of unique_key + updated_at for timestamp strategy (essentially creates a surrogate key).

Some folks want dbt_scd_id to instead be an integer for better performance.

If we swapped to an integer, we’d have to use a hasing function that outputted an integer instead of a string.

Because of the risk of collision, we wouldn't want to make this the default for all users.

Instead, what if we had a config that allowed you to control the hashing function used when generating dbt_scd_id?

Describe alternatives you've considered

Creating a custom materialization to override the outputs from generate_surrogate_key