delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.29k stars 403 forks source link

Support deletion vector #1094

Open wjones127 opened 1 year ago

wjones127 commented 1 year ago

Description

For protocol version 3, will want to support deletion vector.

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors

Question: how do we decide to rewrite vs use delete vector?

Use Case

This enables much faster deletes.

Related Issue(s)

Prerequisites:

houqp commented 1 year ago

Question: how do we decide to rewrite vs use delete vector?

This looks like a tradeoff between faster read performance v.s. faster write that need to be decided case by case? If so, might be better to just let the user decide depending on the expected workload pattern.

guyrt commented 1 year ago

+1 to supporting user-owned tradeoff decision. I'm investigating this feature internally and update patterns in individual tables likely dictate the right decision.

For instance, in many dimension tables, edits may be spread randomly through existing data and merge on read will be more efficient. For fact tables with mostly append pattern (but occasional fact updates), judicious partition plus copy on write may be superior.

aersam commented 1 year ago

Don't know if this helps, just tried to read a deletion vector file, and this seems to be working with the roaring crate:


fn get_deletion_vectors(
    filename: &str,
) -> Result<Vec<RoaringTreemap>, Box<dyn std::error::Error + Send + Sync>> {
    let mut file = File::open(filename)?;
    let mut buf = vec![0; 2];
    file.read(&mut buf).unwrap();
    let version = u16::from_le_bytes(buf.clone().try_into().unwrap());
    assert_eq!(version, 1);
    let mut index = 0;
    let mut vec = Vec::new();
    loop {
        index += 1;
        let mut buf = vec![0; 3];
        let nrread = file.read(&mut buf)?;
        if nrread == 0 {
            return Ok(vec);
        }

        let size_buf = [&[0], &buf[0..3]].concat();
        let datasize = u32::from_be_bytes(size_buf.try_into().unwrap());
        let mut buf = vec![0; 4];
        file.read(&mut buf)?;
        let magic = i32::from_le_bytes(buf.clone().try_into().unwrap());

        assert!(magic == 1681511377);
        if datasize == 0 {
            continue;
        }

        let before = &file.stream_position()?;
        let take: Take<&File> = (&file).take(datasize as u64 - 4);
        let rdr = RoaringTreemap::deserialize_from(take)?;

        //let mut target_file =
        //    File::create("data/deletion_vectors_splitted/delvec_".to_owned() + &index.to_string())?;
        //std::io::copy(&mut take, &mut target_file)?;

        let after = &file.stream_position()?;
        //println!("{}, {}: {}", before, after, datasize);

        vec.push(rdr);
        // seems roaring-rs does not always read to full end
        let mut buf = vec![0; 1];
        file.read(&mut buf)?;

        let mut checksum_buf = vec![0; 4];
        file.read(&mut checksum_buf)?;
    }
}
aersam commented 1 year ago

Would you accept a PR that does add the required metadata as a first step?

roeap commented 1 year ago

Hi @aersam - first of all thanks for the code snipplet, it actually samed me a bit of time working on this elsewhere.

In principle we always welcome contributions. In this case we also do, but there is one caveat. Elsewhere we are currently working hard on getting delta-kernel for rust released which will hopefully significantly boost our protocol support.

The more complex thing here is, that in order to support deletion vectors we have to either support reader V3 and writer v7 (i.e. table features), or support a whole bunch of other delta features as well.

Good news is we are actively working on it, but since this involves some larger blocks of work, its likely going to be a few weeks, before this can fully land...

With all that said, if you profit from having some intermediate partial support, I'd be happy to review PRs :)

aersam commented 1 year ago

Well if it's about weeks I can wait. I know that actually column mapping would be first, just thought that cannot be that hard ;)

I did not know about delta-kernel for rust, I'm really glad to hear about it! To be honest I was a bit disappointed as I thought it will be in Java - nothing against Java, but I much prefer Rust, especially for embedding. Where do I find the code for delta-kernel/rust? Just to observe it a bit

Btw I also corrected the snipped, it had a bug when there are multiple vectors within file.

alippai commented 1 year ago

@roeap where can one follow the Delta kernel initiatives? I saw https://github.com/delta-io/delta/issues/1783 but that's not rust specific, right? Will it happen in this repo or will there be a delta-kernel-rs?

aersam commented 1 year ago

Trying to get the metadata running here: https://github.com/bmsuisse/delta-rs/tree/deletion_vector_meta Once you have the metadata you could use them for example together with duckdb's read_parquet([parquets...],file_row_number=True) to read tables with deletion vectors

djouallah commented 12 months ago

fwiw; Fabric Datawarehouse just added support for deletion vectors and suddenly the delta table produced is no more compatible with Delta_rs :(

boccileonardo commented 2 months ago

Is this feature still on the roadmap? Tables produced by recent databricks runtime include deletion vectors by default, so it seems to me that reading them through rust-based solutions like polars is not currently possible natively.

dylan-lee94 commented 2 months ago

Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.

djouallah commented 2 months ago

Running into the same issue, the latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. This breaks our python code that is reading with DeltaTable or polars.

as a temporary workaround, duckdb do support reading delta table with deletion vectors using the delta extension based on delta kernel not delta_rs