delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Compaction can rewrite files without reducing file count #2591

Open gfredericks opened 2 weeks ago

gfredericks commented 2 weeks ago

Environment

Delta-rs version: 0.15.3

Binding:

Environment: python 3.9.16


Bug

What happened:

I compacted a table, and it replaced a set of files with a new identically-sized set of files.

What you expected to happen:

I expect compaction to do no work if it is not going to reduce the file count. Combining this issue with https://github.com/delta-io/delta-rs/issues/2576 means that there is nothing I can do to get a table into a state where I am sure that compaction is a NOOP (incidentally the example below is also a reproduction of https://github.com/delta-io/delta-rs/issues/2576, showing that it can take more than one compaction to get a table into a minimal state).

How to reproduce it:

import deltalake
import pyarrow as pa

for z in range(10):
    deltalake.write_deltalake(
        './storageloop-table',
        pa.Table.from_pydict(
            {
                "x": pa.array([x % 207 for x in range(1000000)]),
                "y": pa.array([x % 3008 for x in range(1000000)]),
                "z": pa.array([z for _ in range(1000000)]),
            }
        ),
        mode='append',
    )

for _ in range(5):
    dt = deltalake.DeltaTable('./storageloop-table')
    print(f"Table has {len(dt.files())} files pre-compaction")
    # use a small target_size for this toy example so we can
    # reproduce it with smaller data
    stats = dt.optimize.compact(target_size=2**21)
    print(f"Compaction added {stats['numFilesAdded']} files and removed {stats['numFilesRemoved']} files")

Outputs:

Table has 10 files pre-compaction
Compaction added 3 files and removed 9 files
Table has 4 files pre-compaction
Compaction added 2 files and removed 4 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files

More details:

Without having looked at the implementation, my guess is that the compaction algorithm decides it can merge the two files, and issues a write of a single file to the table, and some lower-level mechanism splits it back up into two files.