dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.59k stars 1.59k forks source link

[Feature] Support contracts in Redshift even when the first row has `NULL` values #10490

Open gekas93 opened 1 month ago

gekas93 commented 1 month ago

Is this a new bug in dbt-core?

Current Behavior

I'm having a strange issue with the contracts in my dbt model and working with redshift.

I have fixed some fields with a data_type = 'date' and they were working well but I have added a NULLIF condition like this:

cast(NULLIF(psl.plant_start_date, '9999-12-31') as date) as plant_start_date,
cast(NULLIF(psl.plant_end_date, '9999-12-31') as date) as plant_end_date,

Before of adding the NULLIF function, everything works as expected.

After adding it I am having this message: column_name definition_type contract_type mismatch_reason
plant_end_date TEXT DATE data type mismatch
plant_start_date TEXT DATE data type mismatch

I have tried to cast in each place (before, after, in the middle more than one time) but I always have the same issue. I tried to use a case instead of NULLIF but the same happens.

I have token a look in the macros and I have see that it's this macro the responsible of the behaviour:

{% macro assert_columns_equivalent(sql) %}

  {#-- First ensure the user has defined 'columns' in yaml specification --#}
  {%- set user_defined_columns = model['columns'] -%}
  {%- if not user_defined_columns -%}
      {{ exceptions.raise_contract_error([], []) }}
  {%- endif -%}

  {#-- Obtain the column schema provided by sql file. #}
  {%- set sql_file_provided_columns = get_column_schema_from_query(sql, config.get('sql_header', none)) -%}
  {#--Obtain the column schema provided by the schema file by generating an 'empty schema' query from the model's columns. #}
  {%- set schema_file_provided_columns = get_column_schema_from_query(get_empty_schema_sql(user_defined_columns)) -%}
    {{ log("sql_file_provided_columns: " ~ sql_file_provided_columns) }}
    {{ log("schema_file_provided_columns: " ~ schema_file_provided_columns) }}

I have added two logs at the end to confirm it. The sql_file_provided_columns is giving me the dates as a varchar.

Expected Behavior

DBT detects that the definition type is DATE for plant_start_date and plant_end_date

Steps To Reproduce

  1. Create a SQL with cast(NULLIF(<date_field>, '9999-12-31') as date) as <date_field_name>
  2. Define the contract and define your date name field with a data_type equal to date
    - name: <date_field_name>
        data_type: date
        description: ""
  3. Run your dbt run or dbt build command with the model

Relevant log output

18:00:42.359462 [debug] [Thread-1  ]: sql_file_provided_columns: [..., <Column plant_start_date (character varying(256))>, <Column plant_end_date (character varying(256))>]
18:00:42.360137 [debug] [Thread-1  ]: schema_file_provided_columns: [..., <Column plant_start_date (DATE)>, <Column plant_end_date (DATE)>]
18:00:42.370309 [debug] [Thread-1  ]: On model.project.model_name: Close
18:00:42.373215 [debug] [Thread-1  ]: Compilation Error in model model_name (models/marts/pipeline_folder/model_name.sql)
  This model has an enforced contract that failed.
  Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.

  | column_name      | definition_type | contract_type | mismatch_reason    |
  | ---------------- | --------------- | ------------- | ------------------ |
  | plant_end_date   | TEXT            | DATE          | data type mismatch |
  | plant_start_date | TEXT            | DATE          | data type mismatch |

Environment

- OS: MAC 
- Python: 3.9.18
- dbt-redshift = 1.8.0
- dbt-core = 1.8.1

Which database adapter are you using with dbt?

redshift

Additional Context

No response

dbeatty10 commented 1 month ago

Thanks for reaching out @gekas93 !

This looks like https://github.com/dbt-labs/dbt-redshift/issues/659 where Redshift treats null values as VARCHAR/TEXT even when you say something like cast(null as date) or null::date.

https://github.com/dbt-labs/dbt-redshift/issues/659#issuecomment-1924345236 discusses our suggested workaround which is to add something like this to the beginning of your model to handle this Redshift-specific behavior:

select 
    '9999-12-31'::date as plant_start_date,
    '9999-12-31'::date as plant_end_date,
    # ... the rest of your columns here
where 1=0
union all

Could you try that out and see if it resolves the issue?

gekas93 commented 1 month ago

Thanks for your quick response @dbeatty10 !

Sorry, because I tried to find some topics related with mine but I didn't found anything.

Yes, with your approach it's working but it's a problem when you have a big number of columns. In my case, in this table I have like 50 columns, so I had to add more than 50 code lines and 50 manual casts.

There will be some approach in the future that will solve this?

Obviously, if I add the IFNULL operation in a previously model and it's materialized as a table, I don't have this issue. But if it's materialized as a view (in my case has to be like that), it is still there.