What happened:
In the docs, compact is described as idempotent
This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.
In one of my delta tables, I noticed this is not true. Looking at the optimize algorithm, it's pretty simple, grouping files based on their size up to the target_size.
However, the actual file outputted from this algorithm will often be smaller than the bin.total_file_size() because of parquet compression when the data is actually written.
I ran into a scenario where the sizes of the bins were the following (where the target size is the default 104857600):
which means that you can actually compact it again.
What you expected to happen:
I don't think this is a trivial problem to solve (would need to write temp parquet files to get the actual sizes, and run optimize in a loop based off of these actual file sizes?), so would be satisfied with just removing the idempotency claim in the docs.
Environment
Delta-rs version: 0.17.3
Binding: python
Bug
What happened: In the docs, compact is described as idempotent
In one of my delta tables, I noticed this is not true. Looking at the optimize algorithm, it's pretty simple, grouping files based on their
size
up to thetarget_size
.However, the actual file outputted from this algorithm will often be smaller than the
bin.total_file_size()
because of parquet compression when the data is actually written.I ran into a scenario where the sizes of the bins were the following (where the target size is the default 104857600):
But the actual sizes were:
which means that you can actually compact it again.
What you expected to happen: I don't think this is a trivial problem to solve (would need to write temp parquet files to get the actual sizes, and run optimize in a loop based off of these actual file sizes?), so would be satisfied with just removing the idempotency claim in the docs.