delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

Large Types breaks merge predicate pruning #2632

Open echai58 opened 5 days ago

echai58 commented 5 days ago

Environment

Delta-rs version: 0.18.1

Binding: python


Bug

What happened: The default behavior of large_dtypes=True on merge() causes partition pruning to not work correctly for strings. This is my understanding of why this happens:

large_dtypes=True causes the source table strings to be casted to LargeUTF8. Thus, when datafusion optimizer runs type_coercion step on the physical plan, it goes from:

TableScan: t, partial_filters=[LargeUtf8("a") = p]

to

TableScan: t, partial_filters=[LargeUtf8("a") = CAST(p AS LargeUtf8)]

Then, datafusions pruning does not support casts of non numeric types: https://github.com/apache/datafusion/blob/35.0.0/datafusion/core/src/physical_optimizer/pruning.rs#L813-L832

This means that the pruning predicate is not actually pruning any files. This is evident via the ParquetExec list of files, which shows all files, regardless of if they match the partition or not.

If you set large_dtypes=False, you see the following after the type_coercion optimization step:

TableScan: t, partial_filters=[CAST(Utf8("a") AS Dictionary(UInt16, Utf8)) = p]

Because there is no cast on the right-hand-side, the partition pruning works, and the ParquetExec only has the files from the a partition.

This also seems to be inadvertent an side-effect of the following change in #2326 cc @Blajda https://github.com/delta-io/delta-rs/pull/2326/files#diff-12f59fe3c4440b7ae4ee1a5ac810b42c1d7357c246aae7b5770e840e52d3ec52L1036-R1039

Before, the extra Filter means that even though the ParquetExec had all files, we could still filter out row groups based on the metadata. However, without this, we have to actually load in all the data, which is the symptom I experienced - this renders https://github.com/delta-io/delta-rs/pull/1958 ineffective.

The correct resolution seems to be to add support in DataFusion's pruning for comparing between strings and large strings.

For my use case, setting large_dtypes=False seems to be a workaround.