aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
97 stars 68 forks source link

Add missing features such as merge_exclude_columns and incremental_predicates #457

Open aiss93 opened 2 weeks ago

aiss93 commented 2 weeks ago

Describe the feature

The current implementation of the adapter generates PySpark code with the corresponding queries in the impl.py file. I believe it would be better to handle the compiled SQL generation at the macro level. Generating PySpark code in impl.py has the following drawbacks:

Additionally, we force users to use the glue_catalog namespace to query Iceberg tables. However, there is a solution to avoid this.

Finally, the following features are missing in the current catalog but can be easily implemented if we move SQL generation to the macro level:

Describe alternatives you've considered

By handling the SQL generation at the macro level, we can benefit from the following:

We can use the SparkSessionCatalog as an implementation of the Spark catalog. More information can be found in the following link Iceberg Catalog configuration. I tested the following configruation and it worked well :

--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://al-gdo-dev-ww-dl-0139-transfo/data
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

One thing to note about this configuration, it makes CTAS and RTAS operations non atomic

Are you interested in contributing this feature?

I have already implemented this on my side. I can create a PR and let you check it @moomindani if it's okey for you.

moomindani commented 1 week ago

Thank you for your suggestion. I like this idea. We will need to carefully test that we don't break existing workload. It will require multiple test cases in unit test and integration test. Could you ping me once PR is ready?

aiss93 commented 1 week ago

I believe most of the issues we're facing with the adapter stem from the differences between Hive and Iceberg tables, and how we handle them differently. In my opinion, we should minimize the amount of custom Iceberg code by leveraging its latest releases. The current version of AWS Glue includes Iceberg 1.0.0, but it might be better to provide users with the necessary documentation to use more advanced versions of Iceberg.

Using advanced versions of Iceberg offers the following advantages:

  1. We can use the SparkSessionCatalog implementation for the spark_catalog. This would allow us to eliminate the need for glue_catalog whenever we're dealing with Iceberg tables.
  2. The code will become much simpler, as we won’t have to manage as many Iceberg-specific cases as we do now.
  3. Users can use partitionning transform functions (year, month, day) which are not supported by the 1.0 iceberg version.

This will be a significant change in the adapter implementation and will break some existing workloads (particularly those using Iceberg). We should plan to release this as a major update.

moomindani commented 1 week ago

It's a separate discussion. Please create another Issue if you want to discuss that.

As you know, although each Glue version has corresponding built-in Iceberg from Glue version 3.0, customer can manage the version by introducing the Iceberg JAR through --extra-jars parameter instead of --datalake-formats. Even for supporting that use case, we still need to be extremely careful not to break compatibility, not to break existing workload.

aiss93 commented 1 week ago

Hi @moomindani The PR linked to this issu was tested for :

If it's good on your side, I can add some units tests as well as some documentation.

moomindani commented 1 week ago

Thanks. Yes let's add enough unit test cases and also integration test cases too to cover all major access patterns. The PR looks huge, it needs to be tested extremely carefully in order not to break existing customer workload.

aiss93 commented 1 week ago

I added some test cases and updated the changelog file.

moomindani commented 1 week ago

@aiss93 Thank you for updating the PR. I have added comments.

kyodjinn7 commented 1 day ago

I hope this feature will be release soon. Most of functionnality of dbt are currently unusable with dbt glue when you use Iceberg

aiss93 commented 12 hours ago

Hi @moomindani I added function tests for iceberg table format. Can you please check it ?