Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
513 stars 131 forks source link

[BUG] include_source_columns=false not working #181

Open klaus1978 opened 1 year ago

klaus1978 commented 1 year ago

Describe the bug After upgrading to 0.9.2 (from 0.8.3) I realized problems in the Stage macro when there are weird field names (eg with special characters or blanks) as the escaping is not done automatically anymore.

Environment

dbt version: 1.3.2 dbtvault version: 0.9.2 Database/Platform: Snowflake

To Reproduce if it is not me, but a general issue:

  1. Select a source table with unusual field names ( e.g. "/MY/COLUMN" )
  2. Create the stage model for the source table
  3. Compiling should already fail, even if you don 't use any of the "unusual" fields and disable include_source_columns: {{ dbtvault.stage(include_source_columns=false,

If you check the compiled file, you will see all fields from source in the CTE "source_data" are included ( I would have thought that would be prevented by include_source_columns=false) and not all of them are escaped by '" "' which is causing the problem with unusual field names...

Expected behavior Best would be the same behavior as in 0.8.3: all fields in CTE source_data were escaped...

see also https://dbtvault.slack.com/archives/CRLDUSDEF/p1673540131308529

Many thanks in advance Klaus

AB#5348

klaus1978 commented 1 year ago

Hi, just for clarification, it is not just the include_source_columns=false which is not working.

Also when I use the include_source_columns=true it creates a wrong CTE as the columns are not escaped anymore. Or, some are escaped and other are not...

Output would be eg:

-- Generated by dbtvault.

WITH source_data AS (

    SELECT

    MANDT,
    VBELN,
    POSNR,
    "MATNR",
    "MATWA",
    "PMATN",
    ...
    /BEV1/SRFUND,
    AUFPL_OLC,
    APLZL_OLC,
    /SIE/MED_QMNUM_S,
    /SIE/MED_CSENO_S,

    FROM ANALYTICSLAYER.AN_DVSCM_STAGE_D.RSTAGE_SV_P41_P_VBAP
),...
DVAlexHiggs commented 1 year ago

Yes understood. I was just making sure that the placeholder was removed from the title. Feel free to modify!

DVAlexHiggs commented 1 year ago

Hello, after some investigation this is an issue for a number of reasons.

In terms of a solution, the proposed solution of making it like 0.8.3 where all source columns are escaped won't work for the same reasons we removed it: Escaping by default causes problems in different platforms, and creates unintended errors (especially around casing in postgres) which are frustrating for the user at best and breaking at worst.

A potential way to fix this would be to have a new "escaped_column" config in the stage macro, which would then be used to escape columns in the source column selection, thus giving the user the control they need without causing problems for users who do not need to escape columns.

It would work as below:

{%- set yaml_metadata -%}
source_model:
    raw_source_name: source_table_name
escaped_columns:
   - "BOOKING DATE"
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{% set source_model = metadata_dict["source_model"] %}

{{ dbtvault.stage(include_source_columns=true,
                  source_model=source_model,
                  escaped_columns=escaped_columns) }}

Which would cause "BOOKING DATE" to be escaped in the source columns.

We're currently working out how this would interact with the existing escaping solution and will keep this issue updated.

@klaus1978 Please let us know what you think 😄

klaus1978 commented 1 year ago

Hi, hm this might work, as long as I can use the original field names afterwards? Or would I have to rename "BOOKING DATE" to eg. BOOKING_DATE?

KR Klaus

DVAlexHiggs commented 1 year ago

Hi, thanks for your response!

The idea here is that dbtvault would automatically ensure that the columns in escaped_columns would be referenced in their escaped form throughout the stage model to avoid syntax issues in that specific model, however this escaping would not be retained after the stage; i.e. in your hub you would need to refer to it as "BOOKING DATE" still, with your own escaping such as '"BOOKING DATE"'.

The best practise here would then be to create derived columns which 'fix' the problem columns (i.e. aliases "BOOKING DATE" as "BOOKING_DATE") and would mean you don't have to worry about this downstream. Saying this, this approach would still give users the option to either alias them or leave them as-is, depending on preference.

Does that make sense?

klaus1978 commented 1 year ago

Well, we have quite a lot of fields I'd have to rename. Not sure if this is a special case with our SAP environment. Generally, I would rather keep the original field names in the Raw Vault.

If you have to rename the fields anyway, then I guess there is another possibility where you (the dbtvault team) would not have to do anything: just rename it in a dbt model before you pass it to the stage macros?

KR Klaus