Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[FEATURE] Hub natural keys as strings #135

Open abettsmp opened 2 years ago

abettsmp commented 2 years ago

Is your feature request related to a problem? Please describe. The current release of dbtvault creates the natural keys in the hub based on the type exposed from staging view. If the type of this column is an integer (or other restrictive type) then the hub cannot be the target for sources with broader types for their representation of the natural key. As an example first system uses some id type column (integer) for company entities and then you wish to bring in SalesForce company type entities which represents the natural key as a guid (or similar).

Describe the solution you'd like Default hub macro to force all fields of the natural key of a hub to be a string type.

Describe alternatives you've considered If this is an issue for backward compatibility, expose the functionality via a flag/parameter.

Additional context taken from: Cuba, Patrick. The Data Vault Guru: a pragmatic guide on building a data vault (p. 99). Kindle Edition.

Cast to text Statement: all business keys in a data warehouse or a data lake must be stored as text fields.

There are additional, and I believe valid, reasons given in this section.

AB#5373

DVAlexHiggs commented 1 year ago

Hi. Apologies for the very late response on this. I thought I had replied.

Can I ask if you've tried derived columns? You can cast columns to whatever you want using derived columns.

Here's an example

Generally speaking you shouldn't be modifying the types on the way in to the Raw Vault, but I do understand there's sometimes a need.

I can see how this could be easily implemented as a config flag in the hub though, if that seems more sensible to you?

FrenkLoonen commented 1 year ago

Hi Alex, just to chime in here.

In my project, where we generate all dbtvault config files, we have indeed baked it in to the derived columns. But as you said, it would be nicer to have something like a config flag, as we must make sure that it is indeed applied to all Hub BKs.

Although, if you build it in to the Hub macro, it might not be really visible to the end user. You could perhaps also make it similar to the null_columns, maybe something like string_columns?

In any case, this feature gets my upvote :)

DVAlexHiggs commented 1 year ago

In that case, I'm thinking a typed_columns/cast_columnssection:

typed_columns:
  string_columns:
    - CUSTOMER_ID
  integer_columns:
    - STOCK_QUANTITY

etc.. We can make it string columns only for now and expand on it, but this could be a solution for all typing issues/rules long-term.

Perhaps uneccesary to do anything other than strings though? Not sure

DVAlexHiggs commented 1 year ago

Additionally, we could have a force_hub_bks_to_string global config as a way to do it across all hubs?

The reason I suggest a hub config option though is because these can be provided in YAML properties files and dbt_project.yml

All this config option would do is trigger a CAST on the BK insider the Hub SQL. So, in that regard it would be visible.

FrenkLoonen commented 1 year ago

Not sure if I can think of a need for integer columns right now. But maybe date columns? Based on a certain format. On my current project I have to deal with all types of date formatting that are being delivered as strings.

DD-MM-YYYY DD-MMM-YYYY YYYY.MM.DD And the list goes on and on..


typed_columns:
  string_columns:
    - CUSTOMER_ID
  date_columns:
    - SHIPPING_DATE: DD-MM-YYYY
    - ORDER_DATE: YYYY.MM.DD
FrenkLoonen commented 1 year ago

Additionally, we could have a force_hub_bks_to_string global config as a way to do it across all hubs?

The reason I suggest a hub config option though is because these can be provided in YAML properties files and dbt_project.yml

All this config option would do is trigger a CAST on the BK insider the Hub SQL. So, in that regard it would be visible.

This sounds perfect!