aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

remove update_iceberg_ts from iceberg table when using merge incremental strategy #453 #454

Closed aiss93 closed 4 weeks ago

aiss93 commented 1 month ago

resolves #452

Description

In the current implementation, when using the merge incremental strategy with the Iceberg file format, the adapter automatically adds a column update_iceberg_ts, which is not utilized in the merge process. This causes issues such as breaking schema comparisons between tables in certain scenarios, making the hard-coding of this column restrictive.

This PR removes the hard-coded update_iceberg_ts column from the adapter. If users require this column for specific use cases, they can now add it directly within their model configuration.

Checklist

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

moomindani commented 1 month ago

Hi, thanks for your contribution.

Since this is breaking change, in case users are already using update_iceberg_ts column, dbt-glue version upgrade impacts their workload. Is it possible to think backward-compatible way? I usually prefer safer approach to add config to remove that column.

aiss93 commented 1 month ago

Hi @moomindani, I added a configuration parameter to control this behavior. By default, it will add the update_iceberg_ts column to ensure compatibility with existing workloads and avoid any disruptions.

Additionally, I discovered some hardcoded Spark configurations, such as: {% call statement() %} set spark.sql.autoBroadcastJoinThreshold=-1 {% endcall %} . This configuration will disable broadcasted joins by default, which can negatively impact performance in some cases.

I believe this should be removed, as such Spark configurations should be left for the user to set according to their specific use case and environment.

moomindani commented 1 month ago

Yes, disabling spark.sql.autoBroadcastJoinThreshold can introduce performance bottleneck. It will be better to override the config when spark.sql.autoBroadcastJoinThreshold is set in --conf parameter, or remove that option for future Glue version. But again, we need to be careful not to impact existing workload. And this needs to be discussed in a separate PR/issue.

moomindani commented 1 month ago

Can we add test case for this PR?