Add missing features such as merge_exclude_columns and incremental_predicates

aiss93 commented 1 month ago

Describe the feature

The current implementation of the adapter generates PySpark code with the corresponding queries in the impl.py file. I believe it would be better to handle the compiled SQL generation at the macro level. Generating PySpark code in impl.py has the following drawbacks:

We do not utilize dbt's built-in compilation layer.
The impl.py file becomes difficult to read and debug.
The code found in the dbt/target/ folder does not match what was actually executed in the Spark session.

Additionally, we force users to use the glue_catalog namespace to query Iceberg tables. However, there is a solution to avoid this.

Finally, the following features are missing in the current catalog but can be easily implemented if we move SQL generation to the macro level:

incremental_predicates
merge_exclude_columns

Describe alternatives you've considered

By handling the SQL generation at the macro level, we can benefit from the following:

The executed code will match what was actually run in the Spark session.
SQL templating and generation can be fully managed at the macro level, making the impl.py file more readable and maintainable.

We can use the SparkSessionCatalog as an implementation of the Spark catalog. More information can be found in the following link Iceberg Catalog configuration. I tested the following configruation and it worked well :

--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://al-gdo-dev-ww-dl-0139-transfo/data
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

One thing to note about this configuration, it makes CTAS and RTAS operations non atomic

Are you interested in contributing this feature?

I have already implemented this on my side. I can create a PR and let you check it @moomindani if it's okey for you.

moomindani commented 1 month ago

Thank you for your suggestion. I like this idea. We will need to carefully test that we don't break existing workload. It will require multiple test cases in unit test and integration test. Could you ping me once PR is ready?

aiss93 commented 1 month ago

I believe most of the issues we're facing with the adapter stem from the differences between Hive and Iceberg tables, and how we handle them differently. In my opinion, we should minimize the amount of custom Iceberg code by leveraging its latest releases. The current version of AWS Glue includes Iceberg 1.0.0, but it might be better to provide users with the necessary documentation to use more advanced versions of Iceberg.

Using advanced versions of Iceberg offers the following advantages:

We can use the SparkSessionCatalog implementation for the spark_catalog. This would allow us to eliminate the need for glue_catalog whenever we're dealing with Iceberg tables.
The code will become much simpler, as we won’t have to manage as many Iceberg-specific cases as we do now.
Users can use partitionning transform functions (year, month, day) which are not supported by the 1.0 iceberg version.

This will be a significant change in the adapter implementation and will break some existing workloads (particularly those using Iceberg). We should plan to release this as a major update.

moomindani commented 1 month ago

It's a separate discussion. Please create another Issue if you want to discuss that.

As you know, although each Glue version has corresponding built-in Iceberg from Glue version 3.0, customer can manage the version by introducing the Iceberg JAR through --extra-jars parameter instead of --datalake-formats. Even for supporting that use case, we still need to be extremely careful not to break compatibility, not to break existing workload.

aiss93 commented 1 month ago

Hi @moomindani The PR linked to this issu was tested for :

Glue 3.0 using default iceberg version.
Glue 4.0 using default iceberg version and a custom iceberg version.

If it's good on your side, I can add some units tests as well as some documentation.

moomindani commented 1 month ago

Thanks. Yes let's add enough unit test cases and also integration test cases too to cover all major access patterns. The PR looks huge, it needs to be tested extremely carefully in order not to break existing customer workload.

aiss93 commented 1 month ago

I added some test cases and updated the changelog file.

moomindani commented 1 month ago

@aiss93 Thank you for updating the PR. I have added comments.

The coverage of the additional test cases is not enough to cover the changes added in the PR.
We are missing integration tests for the PR. We need to run the full coverage test with and without the changes, and verify that the PR does not change the behavior for Iceberg access patterns through dbt-glue.

kyodjinn7 commented 1 month ago

I hope this feature will be release soon. Most of functionnality of dbt are currently unusable with dbt glue when you use Iceberg

aiss93 commented 1 month ago

Hi @moomindani I added function tests for iceberg table format. Can you please check it ?

aws-samples / dbt-glue