delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.27k stars 403 forks source link

Double-read of json files from _delta_log #2936

Closed MartinKolbAtWork closed 1 week ago

MartinKolbAtWork commented 2 weeks ago

Environment

Delta-rs version: latest main branch


Bug

What happened: When reading a delta table via URI (e.g. DeltaTableBuilder::from_uri), all json files in the _delta_log directory, which are after the current checkpoint are read twice.

What you expected to happen: When reading a delta table, all json files in the _delta_log directory, which are after the current checkpoint should only be read once. Especially when the file access is remotely and accesses object store buckets, reading things twice is an issue both in terms of performance and costs.

How to reproduce it: Start the unit test test_load_table_read_delta_log from my fork: https://github.com/MartinKolbAtWork/delta-rs/commit/e946422488ca37a3d716962f3c49cca3c1e87c2c

The test uses an adapted ObjectStore implementation, which logs all file access to stdout. It reads the standard test table from test/tests/data/simple_table and the output shows that the respective json files are read twice. In my analysis, I could find out that the two reads are triggered from two subsequent steps in EagerSnapshot::try_new_with_visitor. The call to Snapshot::try_new triggers the first sequence of reads. https://github.com/delta-io/delta-rs/blob/d68633653f18abf8b60f4dcf03faf3a4663cd541/crates/core/src/kernel/snapshot/mod.rs#L373 The call to snapshot.files triggers the second read cascade. https://github.com/delta-io/delta-rs/blob/d68633653f18abf8b60f4dcf03faf3a4663cd541/crates/core/src/kernel/snapshot/mod.rs#L376

In my commit containing the test, I augmented the respective lines with println to have these calls as reference in the output.

---- test_load_table_read_delta_log stdout ----
Starting Snapshot::try_new
PrintStore::get: location: Path { raw: "_delta_log/_last_checkpoint" }
PrintStore::list_with_offset
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000004.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000003.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000002.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000001.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000000.json" }
Starting snapshot.files
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000004.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000003.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000002.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000001.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000000.json" }
Finished
ion-elgreco commented 1 week ago

This is tracked here: https://github.com/delta-io/delta-rs/issues/2776. Currently we separately read the logs to fetch the metadata and protocol actions, and separately for the add actions, and there is no caching done yet