delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.25k stars 400 forks source link

Rethink exposing partition_filters as part of the public facing API #1894

Open MrPowers opened 10 months ago

MrPowers commented 10 months ago

Description

We're currently exposing partition_filters as part of the public-facing API for some methods.

For example, compact() has an optional partition_filers argument.

Let's compare this with the PySpark API:

deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

I think the PySpark API is a lot better from a usability perspective because the user doesn't need to know about the underlying partitioning of the data.

I think the user should be able to specify what data they would like to be compacted. Delta Lake should be smart enough to determine if that means compacting the files in a given partition or running a filtering query and determining the files that need compaction.

ion-elgreco commented 10 months ago

We should indeed not have users think about the partitioning structure. I think the partition filter for the pyarrow writer was mainly there because pyarrow was used. With MERGE we use datafusion and there we properly pass predicates.

Also, I think it's more pythonic to have an optional parameter called predicate instead of another method. We also do that in TableMerger. In the new rust engine binding I am also exposing a predicate parameter but only as string input.

I do wonder, @MrPowers does the optimize operation work if you pass a predicate that is not based on the partitioning structure?