Open wjones127 opened 1 year ago
Question: how do we decide to rewrite vs use delete vector?
This looks like a tradeoff between faster read performance v.s. faster write that need to be decided case by case? If so, might be better to just let the user decide depending on the expected workload pattern.
+1 to supporting user-owned tradeoff decision. I'm investigating this feature internally and update patterns in individual tables likely dictate the right decision.
For instance, in many dimension tables, edits may be spread randomly through existing data and merge on read will be more efficient. For fact tables with mostly append pattern (but occasional fact updates), judicious partition plus copy on write may be superior.
Don't know if this helps, just tried to read a deletion vector file, and this seems to be working with the roaring crate:
fn get_deletion_vectors(
filename: &str,
) -> Result<Vec<RoaringTreemap>, Box<dyn std::error::Error + Send + Sync>> {
let mut file = File::open(filename)?;
let mut buf = vec![0; 2];
file.read(&mut buf).unwrap();
let version = u16::from_le_bytes(buf.clone().try_into().unwrap());
assert_eq!(version, 1);
let mut index = 0;
let mut vec = Vec::new();
loop {
index += 1;
let mut buf = vec![0; 3];
let nrread = file.read(&mut buf)?;
if nrread == 0 {
return Ok(vec);
}
let size_buf = [&[0], &buf[0..3]].concat();
let datasize = u32::from_be_bytes(size_buf.try_into().unwrap());
let mut buf = vec![0; 4];
file.read(&mut buf)?;
let magic = i32::from_le_bytes(buf.clone().try_into().unwrap());
assert!(magic == 1681511377);
if datasize == 0 {
continue;
}
let before = &file.stream_position()?;
let take: Take<&File> = (&file).take(datasize as u64 - 4);
let rdr = RoaringTreemap::deserialize_from(take)?;
//let mut target_file =
// File::create("data/deletion_vectors_splitted/delvec_".to_owned() + &index.to_string())?;
//std::io::copy(&mut take, &mut target_file)?;
let after = &file.stream_position()?;
//println!("{}, {}: {}", before, after, datasize);
vec.push(rdr);
// seems roaring-rs does not always read to full end
let mut buf = vec![0; 1];
file.read(&mut buf)?;
let mut checksum_buf = vec![0; 4];
file.read(&mut checksum_buf)?;
}
}
Would you accept a PR that does add the required metadata as a first step?
Hi @aersam - first of all thanks for the code snipplet, it actually samed me a bit of time working on this elsewhere.
In principle we always welcome contributions. In this case we also do, but there is one caveat. Elsewhere we are currently working hard on getting delta-kernel for rust released which will hopefully significantly boost our protocol support.
The more complex thing here is, that in order to support deletion vectors we have to either support reader V3 and writer v7 (i.e. table features), or support a whole bunch of other delta features as well.
Good news is we are actively working on it, but since this involves some larger blocks of work, its likely going to be a few weeks, before this can fully land...
With all that said, if you profit from having some intermediate partial support, I'd be happy to review PRs :)
Well if it's about weeks I can wait. I know that actually column mapping would be first, just thought that cannot be that hard ;)
I did not know about delta-kernel for rust, I'm really glad to hear about it! To be honest I was a bit disappointed as I thought it will be in Java - nothing against Java, but I much prefer Rust, especially for embedding. Where do I find the code for delta-kernel/rust? Just to observe it a bit
Btw I also corrected the snipped, it had a bug when there are multiple vectors within file.
@roeap where can one follow the Delta kernel initiatives? I saw https://github.com/delta-io/delta/issues/1783 but that's not rust specific, right? Will it happen in this repo or will there be a delta-kernel-rs?
Trying to get the metadata running here: https://github.com/bmsuisse/delta-rs/tree/deletion_vector_meta
Once you have the metadata you could use them for example together with duckdb's read_parquet([parquets...],file_row_number=True)
to read tables with deletion vectors
fwiw; Fabric Datawarehouse just added support for deletion vectors and suddenly the delta table produced is no more compatible with Delta_rs :(
Is this feature still on the roadmap? Tables produced by recent databricks runtime include deletion vectors by default, so it seems to me that reading them through rust-based solutions like polars is not currently possible natively.
Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.
Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.
as a temporary workaround, duckdb do support reading delta table with deletion vectors using the delta extension based on delta kernel not delta_rs
Description
For protocol version 3, will want to support deletion vector.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors
Question: how do we decide to rewrite vs use delete vector?
Use Case
This enables much faster deletes.
Related Issue(s)
Prerequisites:
930
832