Closed MartinKolbAtWork closed 1 week ago
This is tracked here: https://github.com/delta-io/delta-rs/issues/2776. Currently we separately read the logs to fetch the metadata and protocol actions, and separately for the add actions, and there is no caching done yet
Environment
Delta-rs version: latest main branch
Bug
What happened: When reading a delta table via URI (e.g.
DeltaTableBuilder::from_uri
), all json files in the_delta_log
directory, which are after the current checkpoint are read twice.What you expected to happen: When reading a delta table, all json files in the
_delta_log
directory, which are after the current checkpoint should only be read once. Especially when the file access is remotely and accesses object store buckets, reading things twice is an issue both in terms of performance and costs.How to reproduce it: Start the unit test
test_load_table_read_delta_log
from my fork: https://github.com/MartinKolbAtWork/delta-rs/commit/e946422488ca37a3d716962f3c49cca3c1e87c2cThe test uses an adapted ObjectStore implementation, which logs all file access to stdout. It reads the standard test table from
test/tests/data/simple_table
and the output shows that the respective json files are read twice. In my analysis, I could find out that the two reads are triggered from two subsequent steps inEagerSnapshot::try_new_with_visitor
. The call toSnapshot::try_new
triggers the first sequence of reads. https://github.com/delta-io/delta-rs/blob/d68633653f18abf8b60f4dcf03faf3a4663cd541/crates/core/src/kernel/snapshot/mod.rs#L373 The call tosnapshot.files
triggers the second read cascade. https://github.com/delta-io/delta-rs/blob/d68633653f18abf8b60f4dcf03faf3a4663cd541/crates/core/src/kernel/snapshot/mod.rs#L376In my commit containing the test, I augmented the respective lines with
println
to have these calls as reference in the output.