aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

Strange behavior of a table created using 'partition by' with iceberg #413

Open hsh8220 opened 3 months ago

hsh8220 commented 3 months ago

Describe the bug

I created an iceberg table with the 'partition_by' option, and when I use it to run unit tests or create docs, it misbehaves because it reads the wrong value like 'part 0' into the schema.

Steps To Reproduce

make source_table config for Iceberg table:

{{ config(
    materialized='incremental',
    incremental_strategy='append',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
    partition_by=['timestamp']
) }}
...
}}

use source profile example

test:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: comment
      role_arn: arn:aws:iam::123456789101:role/role-name
      region: ap-northeast-2
      workers: 2
      worker_type: G.1X
      glue_version: "4.0"
      datalake_formats: iceberg
      schema: "schema_name"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://bucket/location"
      default_arguments: >
        --enable-metrics=true,
        --enable-continuous-cloudwatch-log=true,
        --enable-continuous-log-filter=true,
        --enable-spark-ui=true,
        --spark-event-logs-path=s3://bucket/glue/spark-ui-log/dbt/,
        --hive.optimize.sort.dynamic.partition=true,
        --spark.sql.iceberg.handle-timestamp-without-timezone=true,
        --spark.sql.iceberg.use-timestamp-without-timezone-in-new-tables=true

test model (hello_world.sql)


{{ config(
    materialized='incremental',
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'},
) }}

WITH data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM {{ source('schema_name', 'source_table') }}
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids 
FROM data
WHERE rn > 1
GROUP BY device_id

unit test profile

unit_tests:
  - name: test_hello_world

    # Always only one transformation to test
    model: hello_world

    # No inputs needed this time!
    # Most unit tests will have inputs -- see the "real world example" section below
    given:
      - input : source('schema_name', 'source_table')
        rows:
          - {device_id : test123, app_name : test1}
          - {device_id : test123, app_name : test1}
          - {device_id : test456, app_name : test2}

    # Expected output can have zero to many rows
    expect:
      rows:
        - {filter_ids: test123}
        - {filter_ids: test456}

Expected behavior

when 'dbt test --select hello_world --target dev run 'ok'

Screenshots and log output

error log

-- Fixture for source_table
select cast('test1' as string)
 as app_name, cast(null as string) as status, cast('test123' as string)
.... (other columns)

cast(null as timestamp) as part 0 
-----------------------------------------^^^
), data as (
    SELECT app_name,
        device_id,
        row_number() OVER (
            PARTITION BY device_id
            ORDER BY app_name, device_id DESC
        ) as rn
    FROM __dbt__cte__source_table
    GROUP BY
        app_name,
        device_id
    ORDER BY rn DESC
)

SELECT device_id as filter_ids
FROM data
WHERE rn > 1
GROUP BY device_id
    ) as __dbt_sbq
    where false
    limit 0

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.4 - Up to date!

Plugins:
  - spark: 1.8.0 - Up to date!
  - glue:  1.8.1 - Up to date!

The operating system you're using:

The output of python --version:

Additional context

The problem occurs when doing unit tests, but also when importing schema from glue metadata via dbt docs generate, 'part 0' is generated unnecessarily. image