aws-samples / dbt-glue

This repository contains de dbt-glue adapter
Apache License 2.0
91 stars 59 forks source link

Support for Iceberg Partition Transforms #333

Open yaroslav-ost opened 5 months ago

yaroslav-ost commented 5 months ago

Hi guys,

It seems we are not able to use Iceberg Partition Transforms inside dbt models.

Table Config: {{ config( materialized='incremental', incremental_strategy='append', file_format='iceberg', partition_by=['months(my_date)'] ) }}

Output: CREATE TABLE glue_catalog.test.test_table USING iceberg PARTITIONED BY (months(my_date)) LOCATION 's3://sandbox-source/client/my/test/test_table' AS SELECT * FROM tmp_test_table ORDER BY (months(my_date))

Error: AnalysisException: Undefined function: months. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.months.

I guess ORDER BY is the root cause since iceberg doesn't support such kind of partition expressions in ORDER BY clause. https://github.com/aws-samples/dbt-glue/blob/a52fcc49f93fcb3405a60a0cd627ca8d91726749/dbt/adapters/glue/impl.py#L874-L882

aiss93 commented 4 months ago

Hi, I think the function you want to use is "month" not "months".

yaroslav-ost commented 4 months ago

Hi @aiss93,

It must be "months". See https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by:~:text=Supported%20transformations%20are%3A

aiss93 commented 4 months ago

These names are used in the older versions of iceberg. Check your iceberg version. The error message you're getting says that the function you're using wasn't found on the catalog. I tried with "month" and I get the error : Transform is not supported

eshetben commented 1 month ago

from my experience, it seems that the problem is with pyspark -

  1. there's the hour() function used for extracting the hour component - that one should be used within sql queries, including the order by clause. https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.hour.html
  2. then there's the hours() function - used solely for partition transforms. should only be used in partition by obviously. https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.functions.hours.html

dbt-glue does not support this since it uses the same function in both order by and partition by.