dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.6k stars 1.59k forks source link

[Bug] Macro adapter.get_columns_in_relation() doesn't return any columns during unit testing #10139

Open DmytroSly opened 3 months ago

DmytroSly commented 3 months ago

Is this a new bug in dbt-core?

Current Behavior

The model I configured a unit test for, relies on the macro adapter.get_columns_in_relation() that I use for one of the models, referenced with the ref() macro. However, the macro adapter.get_columns_in_relation() returns an empty list for a relation during unit testing. This causes the SQL code to compile improperly and fail.

Expected Behavior

The macro adapter.get_columns_in_relation() should return a list of columns for a referenced relation during unit testing.

Steps To Reproduce

Create a SQL model that relies on the list of columns returned by the macro adapter.get_columns_in_relation(), then configure and run a unit test for this model

Relevant log output

No response

Environment

- OS: Windows 10 Enterprise
- Python: 3.11.4
- dbt: 1.8.0

Which database adapter are you using with dbt?

snowflake

Additional Context

I hope it is possible to fix this, since dbt knows for sure the list of columns of all the referenced models. Apart from the input columns provided in the unit test config, dbt injects all other columns with NULL values into the resulting SQL for such a unit test. The input tables for a model, that I am trying to unit test, are really wide - 70-80 columns.

DmytroSly commented 3 months ago

FYI I tried using the override functionality to set the list of columns to be returned by the macro adapter.get_columns_in_relation(), but the problem with the override in my case is that this macro is used two times, and each time for a different model. A different list of columns should be returned for every model, but the override makes it return the same list of columns for different models.

dbeatty10 commented 3 months ago

I'm not sure if this is the same underlying issue or not, but I just opened this issue that looks similar:

dbeatty10 commented 3 months ago

Thanks for reaching out about this @DmytroSly !

Could you provide a simplified version of the models you are having trouble with?

e.g., something like this:

models/model_a.sql

select 1 as id

models/model_b.sql

{%- set columns = adapter.get_columns_in_relation(ref('model_a')) -%}

{% for column in columns %}
  {{ log("Column: " ~ column, info=true) }}
{% endfor %}

select 2 as guid
Elisabeth-Kury commented 3 months ago

I ran into the same problem when writing a unit test using the dbt.utils get_columns_in_relation function, hence why I'm adding to this issue, hope thats fine :) I then wrote a really simplified test model that looks as follows, basically in this example I want to concat all column values except the values for the columns createdat and updatedat.

The model input seeds/test.csv is just a simple test file with one row:

id,name,mail,date,updatedat,createdat
1,John,john.example,2024-05-08,2024-05-01,2024-05-01

models/stg_test.sql

{%- set columns = get_columns_in_relation(ref('test')) -%}
{% set column_names = columns|map(attribute='name')|reject("in", ['createdat','updatedat'])|list %}

select
    *, {{ concat( column_names ) }} as row_hash_value
from {{ ref('test') }}

The dbt run is successful and the compiled code looks fine:

select
     *, id || name || mail || date as row_hash_value
from "my_dwh"."seeds"."stg_test"
    )

However when I run a dbt unit test with the same input as a csv fixture, the test fails with the following error message:

09:04:39    Runtime Error in unit_test logic_unit_test_test (models\b_staging\stg_monta\_stg_monta__unit_tests.yml)
  An error occurred during execution of unit test 'logic_unit_test_test'. There may be an error in the unit test definition: check the data types.
   Database Error
    syntax error at or near "as"
    LINE 42:             *,  as row_hash_value
                             ^

Looking at the compiled code it seems that the dbt unit test is not correctly passing the get_columns_in_relation function or the function is returning nothing:

with __dbt__cte__test as (

-- Fixture for test
select 

    cast('1' as integer)
 as id, 

    cast('John' as text)
 as name, 

    cast('john.example' as text)
 as mail, 

    cast('2024-05-08' as date)
 as date, 

    cast('2024-05-01' as date)
 as updatedat, 

    cast('2024-05-01' as date)
 as createdat
) select
    *,  as row_hash_value
from __dbt__cte__test

Hope this helps, would be great if this was resolved or anyone has an idea how to fix this behaviour.

dbeatty10 commented 3 months ago

That is very helpful @Elisabeth-Kury 🤩

Full details for a reproducible example, including unit test definition below.

### Reprex `seeds/test.csv` ```csv id,name,mail,date,updatedat,createdat 1,John,john.example,2024-05-08,2024-05-01,2024-05-01 ``` `models/stg_test.sql` ```sql {%- set columns = get_columns_in_relation(ref('test')) -%} {% set column_names = columns|map(attribute='name')|reject("in", ['createdat','updatedat'])|list %} select *, {{ concat( column_names ) }} as row_hash_value from {{ ref('test') }} ``` `models/_unit_tests.yml` ```yaml unit_tests: - name: test__stg_test model: stg_test given: - input: ref('test') rows: - {id: 2, name: Ari, mail: ari@example.com, date: 2024-01-01, updatedat: 2024-01-01, createdat: 2024-01-01} expect: rows: - {id: 2, name: Ari, mail: ari@example.com, date: 2024-01-01, updatedat: 2024-01-01, createdat: 2024-01-01, row_hash_value: 2Ariari@example.com2024-01-01} ``` ```shell dbt seed -s test --full-refresh dbt run -s stg_test dbt build -s stg_test ```
jtcohen6 commented 3 months ago

I think this is another case where actually materializing the fixtures as database objects would work, because then each call to get_columns_in_relation would just go the columns in each relation, without any need to mock different args / return values:

I expect that will be significantly slower in the general case, but it would allow the framework to support more-complex models — so while it shouldn't replace the default, it should be something that users can opt into.

DmytroSly commented 2 months ago

@dbeatty10, @jtcohen6, does it mean this issue will be fixed in dbt version 1.9?

jtcohen6 commented 2 months ago

@DmytroSly We've ambitiously slated it as a potential candidate for v1.9, but it's not guaranteed for inclusion