Environment

Delta-rs version: 0.17.3

Binding: python

Bug

What happened: In the docs, compact is described as idempotent

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

In one of my delta tables, I noticed this is not true. Looking at the optimize algorithm, it's pretty simple, grouping files based on their size up to the target_size.

However, the actual file outputted from this algorithm will often be smaller than the bin.total_file_size() because of parquet compression when the data is actually written.

I ran into a scenario where the sizes of the bins were the following (where the target size is the default 104857600):

[104856888, 104687238, 104754998, 104857489, 104679957, 4207383]

But the actual sizes were:

[61364358, 60037383, 58517127, 56870681, 53391180, 3111870]

which means that you can actually compact it again.

What you expected to happen: I don't think this is a trivial problem to solve (would need to write temp parquet files to get the actual sizes, and run optimize in a loop based off of these actual file sizes?), so would be satisfied with just removing the idempotency claim in the docs.

delta-io / delta-rs

Compaction is not idempotent as claimed #2576

Environment

Bug