aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

When using partitions, the function get_columns_in_relation returns unwanted rows #421

Closed Jeremynadal33 closed 2 months ago

Jeremynadal33 commented 3 months ago

Describe the bug

When using partitions, the function get_columns_in_relation returns additional, unwanted rows like : "Part 0", "Part 1", ...

This is a problem when using the merge_update_columns or merge_exclude_columns. The function is called by the spark__get_merge_update_columns macro and adds these irrelevant columns and then getting an error.

Steps To Reproduce

First, create minimal model using that use partitions (called test_partitions) :

{{
    config(
        materialized='table',
        partition_by=[ 'part_col1' ],
        file_format='delta',
    )
}}

with incoming_data as (
    select 1 as id, 'a' as part_col1
    union all
    select 2 as id, 'a' as part_col1
)

select * from incoming_data

Second, run it using :

dbt run -s test_partitions

Third, create a macro that logs the result of the function :

{% macro show_columns() %}
    {% set model_name = 'test_partitions' %}
    {% set _, target_relation = get_or_create_relation(
          database=none,
          schema=target.schema,
          identifier=model_name,
          type='table') -%}

    {% set coluns = adapter.get_columns_in_relation(target_relation) %}

    {{ log("Columns in relation " ~ target_relation.name ~" : " ~ coluns, info=true) }}
{% endmacro %}

Forth, run the macro using :

dbt run-operation show_columns

You should have something similar to the following as an output :

15:49:05  Columns in relation test_partitions : [GlueColumn(column='id', dtype='int', char_size=None, numeric_precision=None, numeric_scale=None), GlueColumn(column='part_col1', dtype='string', char_size=None, numeric_precision=None, numeric_scale=None), GlueColumn(column='Part 0', dtype='part_col1', char_size=None, numeric_precision=None, numeric_scale=None)]

Expected behavior

I should have only "actual" columns as an output. In the case described above, something similar to :

15:49:05  Columns in relation test_partitions : [GlueColumn(column='id', dtype='int', char_size=None, numeric_precision=None, numeric_scale=None), GlueColumn(column='part_col1', dtype='string', char_size=None, numeric_precision=None, numeric_scale=None)]

Screenshots and log output

See section above.

System information

The output of dbt --version:

Core:
  - installed: 1.8.3
  - latest:    1.8.4 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - glue:  1.8.1 - Up to date!
  - spark: 1.8.0 - Up to date!

The operating system you're using: MacBook Pro 2,3 GHz Intel Core i9 8 cœurs The output of python --version:

Python 3.12.4

Ideas to solve the issue

I think we should add a step similar to this one and get rid of this unwanted columns