I compacted a table, and it replaced a set of files with a new identically-sized set of files.
What you expected to happen:
I expect compaction to do no work if it is not going to reduce the file count. Combining this issue with https://github.com/delta-io/delta-rs/issues/2576 means that there is nothing I can do to get a table into a state where I am sure that compaction is a NOOP (incidentally the example below is also a reproduction of https://github.com/delta-io/delta-rs/issues/2576, showing that it can take more than one compaction to get a table into a minimal state).
How to reproduce it:
import deltalake
import pyarrow as pa
for z in range(10):
deltalake.write_deltalake(
'./storageloop-table',
pa.Table.from_pydict(
{
"x": pa.array([x % 207 for x in range(1000000)]),
"y": pa.array([x % 3008 for x in range(1000000)]),
"z": pa.array([z for _ in range(1000000)]),
}
),
mode='append',
)
for _ in range(5):
dt = deltalake.DeltaTable('./storageloop-table')
print(f"Table has {len(dt.files())} files pre-compaction")
# use a small target_size for this toy example so we can
# reproduce it with smaller data
stats = dt.optimize.compact(target_size=2**21)
print(f"Compaction added {stats['numFilesAdded']} files and removed {stats['numFilesRemoved']} files")
Outputs:
Table has 10 files pre-compaction
Compaction added 3 files and removed 9 files
Table has 4 files pre-compaction
Compaction added 2 files and removed 4 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
More details:
Without having looked at the implementation, my guess is that the compaction algorithm decides it can merge the two files, and issues a write of a single file to the table, and some lower-level mechanism splits it back up into two files.
Environment
Delta-rs version:
0.15.3
Binding:
Environment: python
3.9.16
Bug
What happened:
I compacted a table, and it replaced a set of files with a new identically-sized set of files.
What you expected to happen:
I expect compaction to do no work if it is not going to reduce the file count. Combining this issue with https://github.com/delta-io/delta-rs/issues/2576 means that there is nothing I can do to get a table into a state where I am sure that compaction is a NOOP (incidentally the example below is also a reproduction of https://github.com/delta-io/delta-rs/issues/2576, showing that it can take more than one compaction to get a table into a minimal state).
How to reproduce it:
Outputs:
More details:
Without having looked at the implementation, my guess is that the compaction algorithm decides it can merge the two files, and issues a write of a single file to the table, and some lower-level mechanism splits it back up into two files.