Open clarkzinzow opened 8 months ago
This functionality is not currently supported in delta-rs, but it is in delta-kernel-rs. Our plan is currently to develop a new, all-rust code path for read_deltalake
that uses delta-kernel-rs instead, which will be disabled by default initially, but can be enabled with a parameter and will support deletion vector reading
@kevinzwang Do you guys have an update on the delta-kernel-rs implementation? :) I think the kernel is the most sustainable approach. We are currently, evaluating daft. It would be an important requirement because most of our data is saved in delta tables with deletion vectors activated.
@kevinzwang Do you guys have an update on the delta-kernel-rs implementation? :) I think the kernel is the most sustainable approach. We are currently, evaluating daft. It would be an important requirement because most of our data is saved in delta tables with deletion vectors activated.
We need delta-kernel-rs to implement writes! Otherwise we can't migrate :(
@jaychia Thanks for the quick response. Writes are for me not important for now. But rather reading Delta tables with existing deletion vectors. Based on above it seems already available in Delta Kernel :)
@kevinzwang any thoughts on a partial migration to delta-kernel-rs for just the reads?
From what I've seen, delta-rs has not made much progress on deletion vector support so far. I'll explore using delta-kernel-rs for reads
I think we are waiting for delta-kernel-rs to reach a more mature state before we make the engineering effort to switch to them. As a result, delta reads with deletion vectors won't be in our near term plans, but I'll check in on this issue again in the future.
In the meantime @datanikkthegreek (or others) if you would be interested in contributing this functionality, I'd be happy to help!
@kevinzwang Thanks :)
I am not really familiar in rust. So unfort. I can't support. I am happy to support testing the feature once in pre-release though and give feedback if needed.
No worries @datanikkthegreek. We'll let you know once we support this!
Delta Lake can improve the efficiency of row deletions with deletion vectors, which is an "has this row been deleted" bitmap that elides rewriting Parquet files whenever a row is deleted. Our Delta Lake reader should support reading Delta Lake tables that contain deletion vectors, which would involve carrying along the bitmap into the Parquet scan as a mask and pruning the relevant rows from the scan before they are decoded and materialized.
Resources