ScalefreeCOM / datavault4dbt

Scalefree's dbt package for a Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt including the Staging Area, DV2.0 main entities, PITs and Snapshot Tables.
https://www.scalefree.com/
Apache License 2.0
133 stars 26 forks source link

Value starting with space results in same hash key #211

Open OGrohmann opened 1 month ago

OGrohmann commented 1 month ago

In our staging model we receive duplicate hash keys during full extract because source system allows primary key with leading space. As datavault4dbt is trimming the values before generating the hash key, value 'ABC' results in same hash key as value ' ABC' (with leading space). What is the purpose of the trimming during hash key generation? In the example mentioned above, I need to have separate hash keys as attribute values are different for both entries. Is it possible to turn off the trimming in the configuration? Thanks!

tkirschke commented 1 month ago

Hi @OGrohmann ,

In your example, would "ABC" and " ABC" be two different business objects?

The TRIM function is used in our hash function to enable adding columns to existing satellites, without having to recalculate the hashdiff of existing records. Thanks to the TRIM, the NULL values for the new attributes lead to no changes in the hashdiff.

I agree that it should be made configurable for hashkeys. My first idea would be to make it configurable per hashkey/hashdiff, so that the hashed_columns inside a stage might look like this:

hashed_columns:
    hashkey_with_trim:
        - bk1
    hashkey_without_trim:
        - use_trim: false
        - columns:
            - bk1

But since this would have a huge impact on everything related to hashing, weneed some time to implement this.

A quickfix for you could be to create a derived column, which calls a REGEX_REPLACE, which is configured in a way, where leading whitespaces are replaced with special characters, like pipes (|).