Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[FEATURE] Multiple Hub-Records in a Source Over the Same Row #236

Open MartinHofpower opened 5 months ago

MartinHofpower commented 5 months ago

Basic Requirement I want to create a link that refers to the same hub twice. For the link itself this basically works well via different qualifiers for the Hub-Hashkey that is derived in the staging view. The problem is that the referenced Hashkeys need to be inserted in the Hub as well but as far as I see it this is not possible for different columns from one source at the same time.

Example Let's assume we have one source src_customer_connection with the following columns:

When we load a link CustomerToCustomer_L it is easy to derive hashkeys for both customers and get something like Customer_HK_ParentRecord and Customer_HK_ChildRecord in the link. But these hashkeys should also be present in the Hub Customer_H itself. Let's say we declare a hub Customer_H with just CustomerNumber as natural business key. Then we can derive the Customer_H_HK either from CustomerNumber_ParentRecord or from CustomerNumber_ChildRecord in the staging model for src_customer_connection but not from both. If additional data (from additional source files) for the parent and child customer records are not present at the moment the src_customer_connection is loaded we get referential problems since the CustomerHK*** in the link might reference to non existing hub-records.

Describe the solution you'd like The best solution would be to insert the necessary records from the link into the hub automatically in the background but I understand that this is not how automate_dv is designed. What would fit into the framework from my point of view would be the possibility to somehow declare that one row of a source file could contain multiple hub entries. I think the precise metadata-definition is subject to discussion but this option would solve the problem implicitly.

Describe alternatives you've considered A workaround that I see right now is the declaration of separate staging views for the same source. This gives the possibility to declare the exact names for the hub based on different source column names.

Additional context This is a problem that came up in one of my projects and I didn't find it in any other issue. We are using automate_dv on sql-server.

AB#5439