Closed aiss93 closed 3 weeks ago
Thank you for your suggestion. I like this idea. We will need to carefully test that we don't break existing workload. It will require multiple test cases in unit test and integration test. Could you ping me once PR is ready?
I believe most of the issues we're facing with the adapter stem from the differences between Hive and Iceberg tables, and how we handle them differently. In my opinion, we should minimize the amount of custom Iceberg code by leveraging its latest releases. The current version of AWS Glue includes Iceberg 1.0.0, but it might be better to provide users with the necessary documentation to use more advanced versions of Iceberg.
Using advanced versions of Iceberg offers the following advantages:
This will be a significant change in the adapter implementation and will break some existing workloads (particularly those using Iceberg). We should plan to release this as a major update.
It's a separate discussion. Please create another Issue if you want to discuss that.
As you know, although each Glue version has corresponding built-in Iceberg from Glue version 3.0, customer can manage the version by introducing the Iceberg JAR through --extra-jars
parameter instead of --datalake-formats
.
Even for supporting that use case, we still need to be extremely careful not to break compatibility, not to break existing workload.
Hi @moomindani The PR linked to this issu was tested for :
If it's good on your side, I can add some units tests as well as some documentation.
Thanks. Yes let's add enough unit test cases and also integration test cases too to cover all major access patterns. The PR looks huge, it needs to be tested extremely carefully in order not to break existing customer workload.
I added some test cases and updated the changelog file.
@aiss93 Thank you for updating the PR. I have added comments.
I hope this feature will be release soon. Most of functionnality of dbt are currently unusable with dbt glue when you use Iceberg
Hi @moomindani I added function tests for iceberg table format. Can you please check it ?
Describe the feature
The current implementation of the adapter generates PySpark code with the corresponding queries in the impl.py file. I believe it would be better to handle the compiled SQL generation at the macro level. Generating PySpark code in impl.py has the following drawbacks:
Additionally, we force users to use the glue_catalog namespace to query Iceberg tables. However, there is a solution to avoid this.
Finally, the following features are missing in the current catalog but can be easily implemented if we move SQL generation to the macro level:
Describe alternatives you've considered
By handling the SQL generation at the macro level, we can benefit from the following:
We can use the SparkSessionCatalog as an implementation of the Spark catalog. More information can be found in the following link Iceberg Catalog configuration. I tested the following configruation and it worked well :
One thing to note about this configuration, it makes CTAS and RTAS operations non atomic
Are you interested in contributing this feature?
I have already implemented this on my side. I can create a PR and let you check it @moomindani if it's okey for you.