Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

[Catalogs] [Delta Lake] Add support for reading tables with deletion vectors #1954

Open clarkzinzow opened 8 months ago

clarkzinzow commented 8 months ago

Delta Lake can improve the efficiency of row deletions with deletion vectors, which is an "has this row been deleted" bitmap that elides rewriting Parquet files whenever a row is deleted. Our Delta Lake reader should support reading Delta Lake tables that contain deletion vectors, which would involve carrying along the bitmap into the Parquet scan as a mask and pruning the relevant rows from the scan before they are decoded and materialized.

Resources

kevinzwang commented 4 months ago

This functionality is not currently supported in delta-rs, but it is in delta-kernel-rs. Our plan is currently to develop a new, all-rust code path for read_deltalake that uses delta-kernel-rs instead, which will be disabled by default initially, but can be enabled with a parameter and will support deletion vector reading

datanikkthegreek commented 1 month ago

@kevinzwang Do you guys have an update on the delta-kernel-rs implementation? :) I think the kernel is the most sustainable approach. We are currently, evaluating daft. It would be an important requirement because most of our data is saved in delta tables with deletion vectors activated.

jaychia commented 1 month ago

@kevinzwang Do you guys have an update on the delta-kernel-rs implementation? :) I think the kernel is the most sustainable approach. We are currently, evaluating daft. It would be an important requirement because most of our data is saved in delta tables with deletion vectors activated.

We need delta-kernel-rs to implement writes! Otherwise we can't migrate :(

datanikkthegreek commented 1 month ago

@jaychia Thanks for the quick response. Writes are for me not important for now. But rather reading Delta tables with existing deletion vectors. Based on above it seems already available in Delta Kernel :)

jaychia commented 1 month ago

@kevinzwang any thoughts on a partial migration to delta-kernel-rs for just the reads?

kevinzwang commented 1 month ago

From what I've seen, delta-rs has not made much progress on deletion vector support so far. I'll explore using delta-kernel-rs for reads

kevinzwang commented 1 month ago

I think we are waiting for delta-kernel-rs to reach a more mature state before we make the engineering effort to switch to them. As a result, delta reads with deletion vectors won't be in our near term plans, but I'll check in on this issue again in the future.

In the meantime @datanikkthegreek (or others) if you would be interested in contributing this functionality, I'd be happy to help!

datanikkthegreek commented 3 weeks ago

@kevinzwang Thanks :)

I am not really familiar in rust. So unfort. I can't support. I am happy to support testing the feature once in pre-release though and give feedback if needed.

kevinzwang commented 3 weeks ago

No worries @datanikkthegreek. We'll let you know once we support this!