Delta Lake Partition pruning not working in merge command

delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

https://delta.io

Apache License 2.0

7.62k stars 1.71k forks source link

Delta Lake Partition pruning not working in merge command #668

Open fanaticjo opened 3 years ago

fanaticjo commented 3 years ago

Delta Lake merge command partition pruning is not working in the explain plan i dont see any partition filters being applied and its being used as filters

But when i am doing a simple delta read i am able to see the partition filters

Using delta lake version 8 in spark 3 emr version 6.2.0

Please help

dennyglee commented 3 years ago

This most likely has to do with the specifics of your query vs. MERGE. Can you provide additional information so its possible to recreate this?

tdas commented 3 years ago

Could share us your MERGE query so that we can see what is the merge condition?

fanaticjo commented 3 years ago

The table is partitioned by date S3 list s3://bucketname/delta/tablename/part_key=2020-12-01 s3://bucketname/delta/tablename/part_key=2020-12-02

Query for merge Case 1 deltaTable.alias("old") \ .merge( df.alias("new"), f"old.part_key in ('2020-12-01') and old.id = new.id") \ .whenMatchedUpdateAll() \ .whenNotMatchedInsertAll().execute() Didnt work

Case 2 spark.sql(f"""merge into source using new on source in ('2020-12-01') and source.id=new.id when matched then update set when not matched then insert """)

Didnt work

Case 3 spark.sql(f"""merge into source using new on source in (to_date('2020-12-01','yyyy-MM-dd')) and source.id=new.id when matched then update set when not matched then insert """)

fanaticjo commented 3 years ago

This most likely has to do with the specifics of your query vs. MERGE. Can you provide additional information so its possible to recreate this?

any updates ?

dennyglee commented 3 years ago

Thanks for your patience @fanaticjo - looking at the above queries, there isn't anything obvious. By any chance can you provide a dataset so we can recreate this? By any chance can you also test using Spark 3.1.x / Delta Lake 1.0.x to determine if the behavior is any different?