Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
514 stars 131 forks source link

[FEATURE] Allow hashed_columns to see derived columns to allow concatenated business keys #9

Closed balmasi closed 3 years ago

balmasi commented 4 years ago

I've come across a few instances where I need to hash a derived column in my staging area. Sometimes this is because I simply want to append a manual string to my BK, other times I want to rename something in my staging for use down the line, but I don't have access to it in the hashed_columns section so I have to write SQL or use a field name I'm renaming (or be forced to create an extra model to feed the stage macro)

for example, I have to do the following:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK:
                  - "(SELECT 'sfdc.Lead' AS SCOPE_BK)" # SCOPE_BK
                  - Website # DOMAIN_BK        
              derived_columns:
                SCOPE_BK: '!sfdc.Lead'
                DOMAIN_BK: Website

What I'd like to do:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK:
                  - SCOPE_BK
                  - DOMAIN_BK
              derived_columns:
                SCOPE_BK: '!sfdc.Lead'
                DOMAIN_BK: Website

Additionally, I'm finding an enterprise-wide unique natural key is hilariously difficult. For example imagine I have an HR system, Contractor system, and a partner company system and I want to represent a single concept of Employee.

image

As you can see the unique keys here are different in each case, so my Business Key has to be a concatenation of multiple keys in the source system. This is what is referred to as src_nk in your hub macro.

In the above example, if I had a 2-part key, passing both keys as src_nk to the hub macro would result in 2 columns SCOPE_BK and DOMAIN_BK, however, I'm only after a single concatenated column.

It would be amazing if dbtvault could support this concatenation as a first-class concept (using the same transformation built into the hash function)

In this case, I might simply have, in each of the staging configs that feed into the hub, something like:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK: FUNNEL_BK
              derived_columns:
                HUB_FUNNEL_BK: 
                  - '!sfdc.Lead'
                  -  Website

This would generate a table that looks like this:

HUB_FUNNEL_HK HUB_FUNNEL_BK
98a7sd8a9s7da9fgag7a98a7sda sfdc.Lead||abc.com
98a7sd8a9gsdf3ggag7a98a7sda sfdc.Lead||def.com
98a7sd8a9s7da9fgag7a98a123a sfdc.Lead||efg.com

Currently, I don't even have access to the hash concat macro by itself cause it's a part of the hash macro.

Anyway, I know this is kinda multiple issues, but I wanted to put it out there to see if others felt similar pains.

Of course, as with everything else dbt, you can work around this by adding a separate model before your dbtvault one, but I felt I was doing this all the time and something that can be easily factored out.

DVAlexHiggs commented 4 years ago

Great ideas. Both of these features are quite and straight forward to implement. Derived columns bring accessible by hashed columns is something we already have in the backlog, and the derived column concatenation syntax is certainly very useful!

DVAlexHiggs commented 3 years ago

Added in v0.7.1 Close if working as expected

DVAlexHiggs commented 3 years ago

Closing as we have implemented the primary suggested feature. The concatenation of columns in the derived_columns configuration has been separated into a new issue (#20) and will be added to the next release (surprise: it's already been developed!)