Open Visorgood opened 1 month ago
I was investigating this more, and tried an approach suggested on this StackOverflow page. In a nutshell, it suggests to use a CTE or a temp view for the target table, to limit the deletions to a needed scope. When I tried this:
WITH t AS (
SELECT *
FROM iceberg_table
WHERE part_col_1 = 'some_part_value_a' AND part_col_2 = 'some_part_value_b' AND part_col_3 >= '2024-09-25 00:00:00' AND part_col_3 < '2024-09-25 04:00:00'
)
MERGE INTO t USING new_data_to_upsert s ON ... {the rest of the query is the same}
I received an error:
Error occurred during query planning:
MERGE INTO TABLE is not supported temporarily.
cc @aokolnychyi
Any update?
Apache Iceberg version
1.6.0
Query engine
Spark
Please describe the bug 🐞
MERGE INTO command is doing a full Iceberg table scan, even though the table is well partitioned. The idea of running this command is an idempotent ingestion/reingestion of a "4 hour block" of new data into a (hidden) daily partition.
I’m using Spark 3.5.2 and Iceberg 1.6.0.
Table schema is as such:
MERGE INTO command I’m running is as follows:
View
new_data_to_upsert
has the same columns as the table, and all its records havepart_col_1
equalsome_part_value_a
,part_col_2
equalsome_part_value_b
,part_col_3
in the range[2024-09-25 00:00:00; 2024-09-25 04:00:00)
.I also
.persist()
it before doing.createOrReplaceTempView("new_data_to_upsert")
.Query plan shows for the
iceberg_table
scan the following:As far as I understand
filters=true
is the problem - no predicate push-down. When I try a regular SELECT or DELETE, it shows smth like:If I remove
WHEN NOT MATCHED BY SOURCE … THEN DELETE
clause from the query, the plan is different and the BatchScan steps to the Iceberg table do havefilters
specified. No matter what I've tried with the predicate in this clause, it never resulted in the predicate push-down.Maybe this is related to this issue: https://github.com/apache/iceberg/issues/10108
Willingness to contribute