Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[FEATURE] Support String Value for Record Source and LDTS in HUBS/SATS/LINKS as we do for Staging #189

Open RonniePitts opened 1 year ago

RonniePitts commented 1 year ago

Currently the Staging method supports string values, e.g.

derived_columns: SOURCE_SYSTEM: '!{{ somevalue }}' INSTANCE: '!someinstance'

It would be nice to support the string value in Hubs/links/sats methods as well instead of just a column value.

Describe the solution you'd like Support the use of string values, e.g. ! for LDTS and RECORD SOURCE

AB#5363

RonniePitts commented 1 year ago

Probably some benefit instead of have unnecessary columns for LDTS and RECORD SOURCE in Staging layer.

DVAlexHiggs commented 1 year ago

Hi,

Welcome and thanks for your suggestion.

Whilst the idea makes sense on paper, I can't think of a use case where we wouldn't want the RECORD_SOURCE or LDTS in the staging layer.

There are a few reasons for the current implementation at this time:

  1. If we have it in the raw vault, hard coded then we have no traceability and no audit.

  2. RECORD_SOURCE and LDTS are often shared by multiple tables in the raw vault (units of work) and rather than repeating yourself everywhere, referencing the column is much cleaner, especially since that specific stage object will likely be loading multiple raw vault objects

  3. You may be loading your hubs and links from multiple staging objects, in which case hard coding the LDTS or RECORD_SOURCE is unwanted behavior, as it's likely those staging objects have multiple sources between them (each record could be from any of the sources)

For the above reasons I'm not sure having these columns in the stage is unnecessary. It is in fact quite standard.

Saying all this, we're not against adding this if it's seen to be useful by the wider community. Could you help us understand where this might be used?

Thanks!

RonniePitts commented 1 year ago

Hi Alex,

Actually, the value will be sourced from a macro and reference the dbt node name so the value will change depending on the model its referencing so technically it isn’t hardcoded as such.

We use a control table for our LDTS so depending on the model it will retrieve the control date from that source.

Understand typically its in the staging table but since its possible to support macros we can be a bit creative how we now populate this value downstream.

Thanks

DVAlexHiggs commented 1 year ago

Yeah this makes sense. Sometimes record sources are actually codes and it is good practice to have a lookup table which maps these to system names.

I presume this is what you mean? Sounds like you have a macro to do this kind of lookup

RonniePitts commented 1 year ago

Correct, I already use a macro for the Data Source aka Folder name retrieved from the node in STG.

It makes it more useable from a coding point than having columns for the sake of it.