Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.05k stars 139 forks source link

[Catalogs] [Delta Lake] Add support for reading tables with deletion vectors #1954

Open clarkzinzow opened 6 months ago

clarkzinzow commented 6 months ago

Delta Lake can improve the efficiency of row deletions with deletion vectors, which is an "has this row been deleted" bitmap that elides rewriting Parquet files whenever a row is deleted. Our Delta Lake reader should support reading Delta Lake tables that contain deletion vectors, which would involve carrying along the bitmap into the Parquet scan as a mask and pruning the relevant rows from the scan before they are decoded and materialized.

Resources

kevinzwang commented 2 months ago

This functionality is not currently supported in delta-rs, but it is in delta-kernel-rs. Our plan is currently to develop a new, all-rust code path for read_deltalake that uses delta-kernel-rs instead, which will be disabled by default initially, but can be enabled with a parameter and will support deletion vector reading