delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Compaction is not idempotent as claimed #2576

Open echai58 opened 3 weeks ago

echai58 commented 3 weeks ago

Environment

Delta-rs version: 0.17.3

Binding: python


Bug

What happened: In the docs, compact is described as idempotent

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

In one of my delta tables, I noticed this is not true. Looking at the optimize algorithm, it's pretty simple, grouping files based on their size up to the target_size.

However, the actual file outputted from this algorithm will often be smaller than the bin.total_file_size() because of parquet compression when the data is actually written.

I ran into a scenario where the sizes of the bins were the following (where the target size is the default 104857600):

[104856888, 104687238, 104754998, 104857489, 104679957, 4207383]

But the actual sizes were:

[61364358, 60037383, 58517127, 56870681, 53391180, 3111870]

which means that you can actually compact it again.

What you expected to happen: I don't think this is a trivial problem to solve (would need to write temp parquet files to get the actual sizes, and run optimize in a loop based off of these actual file sizes?), so would be satisfied with just removing the idempotency claim in the docs.

sherlockbeard commented 23 hours ago

removing idempotency from docs will be good . tried the #2591 example in spark delta lake same result 1 optimize 4 files , 2 optimize 2 files