aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

DBT Tests not working with Glue Iceberg Tables #411

Open Wellington-Costa91 opened 4 months ago

Wellington-Costa91 commented 4 months ago

Describe the bug

I'm reaching out for assistance with running DBT tests using AWS Glue Iceberg tables. It appears that the test module does not support the glue_catalog prefix required for Iceberg Tables. I have attempted several workarounds without success.

Versions: Running with dbt=1.8.4 Registered adapter: glue=1.8.1

Steps To Reproduce

Create a dbt profile for Iceberg tables

Sample config for Iceberg table:

{{
  config(
    unique_key=["my_key"],
    partition_by=["date_ingestion"],
    materialized="incremental",
    incremental_strategy='merge',
    file_format='iceberg',
    table_properties={'format-version': '2'},
    iceberg_expire_snapshots='False',
    tags=["my_tag_name"],
    pre_hook="SET hive.default.fileformat=parquet",
      )
}}

Sample of the Glue Profile

glue_profile:
  outputs:
    silver_light:
      type: glue
      query-comment: Profile for  Silver Layer Dev
      role_arn: arn:aws:iam::123456789101:role/role-name
      region: sa-east-1
      glue_version: "4.0"
      workers: 2
      worker_type: G.1X
      schema: "schema_name"
      session_provisioning_timeout_in_seconds: 600
      idle_timeout: 10
      location: "s3://bucket-name/silver"
      datalake_formats: iceberg
      default_arguments: "--enable-auto-scaling=true, --enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://bucket-name-logs/dbt-spark-logs/"
      conf: --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.warehouse=s3://bucket-name-tmp/ --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager --conf spark.sql.catalog.glue_catalog.lock.table=tbl_glue_dbt_lock_table  --conf spark.sql.legacy.allowNonEmptyLocationInCTAS=true --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED --conf  spark.kryoserializer.buffer.max=1GB --conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED
      threads: 2

The model schema sample:

version: 2

models: 
  - name: "my_tag_name"
    description: "Description"
    columns:
      - name: "identifier_column"
        description: "Description_column"
        data_tests:
          - unique
          - not_null
...

Command executed: dbt test --select tag:my_tag_name --target=silver_light

Expected behavior

Error obtained because it tried to find the table without the prefix glue_catalog

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table my_tag_name. StorageDescriptor#InputFormat cannot be null for table: my_tag_name (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
18:43:01  2 of 2 ERROR unique_my_tag_name_identifier_column ............ [ERROR in 624.84s]

Screenshots and log output

LogErrorDbtIceberg

System information

The output of dbt --version:

Core:
  - installed: 1.8.4
  - latest:    1.8.4 - Up to date!

Plugins:
  - glue:  1.8.1 - Up to date!
  - spark: 1.8.0 - Up to date!

The operating system you're using: macOS Sonoma Version 14.3.1

The output of python --version: Python 3.9.6

Additional context

The Amazon Documentation says that to access Iceberg Tables in glue with spark, it's needed to use the prefix glue_catalog. before the database/table name. https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/iceberg-spark.html When trying to use the query in dbt-logs, there is the error where it cannot find the Table, but if we use the glue_catalog prefix required for Iceberg Tables, we can access the data.