Cydhra / vers

Succinct data structures using very efficient rank and select
Apache License 2.0
62 stars 4 forks source link

WaveletMatrix #4

Closed somethingelseentirely closed 3 months ago

somethingelseentirely commented 6 months ago

It would be nice if there was a WaveletMatrix implementation, as it would enable the most common succinct-datastructure applications in text-processing and databases.

Cydhra commented 6 months ago

I never used or worked with those, but from skimming the paper, it seems a lot less painful than my attempts at implementing succinct trees. Adding a feature that actually has use cases and is easier to implement definitely sounds more fun. I am pretty tied up right now, but when I eventually have some free time again, I'll give it a shot. Do sucds and cseq implement Wavelet Trees or Matrices?

somethingelseentirely commented 6 months ago

Yeah, they are both implementing WaveletMatrices. I think the main insight for wavelet matrices comes from the fact that the total number of bits in each level stays constant for a balances wavelet tree (this invariant doesn't hold for compressed versions like huffman trees), which makes them also a lot easier to implement.

I'd be happy to also give an implementation a shot, but I'd probably aready try to build it with zerocopy deserializability in mind (#5) if that's ok with you. Because I need this for an on-disk db storage format.

Cydhra commented 6 months ago

Yeah feel free to. Zero copy deserializability is a feature that I'd like, so I don't think there is a reason to keep the two issues disjoint.

Cydhra commented 4 months ago

Implementing it with zero-copy deserializability in mind is obviously possible, but not really necessary. While it is true, that the data structures are immutable, they do not freely lend themselves to zero-copy (de)serialization. Zero-copy deserialization has to mind the endianess of the backing system. That is, each zero copy data structure needs to have multiple versions which handle either endianness, with different numeric types. Mapping, for example, a big-endian encoded file into a little-endian machine works, but the data types must be handled differently because of the inverted byte order. In contrast, converting the big-endian structure into little-endian on the fly is also possible, but destroys the zero-copy property. We cannot just write a data structure that is already the archived version, which rkyv usually provides, because of that.

I started implementing a WaveletMatrix on the branch dev_wavelet. The data structure, just like the other data structures in this library is agnostic to zero-copy deserialization. If zero-copy serialization were to be implemented for it, more data types that represent either endianness (among other things like #[repr(C)]) need to be added. The only thing that is possible to already do here, is to add the functionality of the WaveletMatrix via a trait, to later reuse code for the zero copy versions.

Zero copy versions of other data structures, as outlined in #5, need these traits too, inducing potentially breaking changes.

somethingelseentirely commented 4 months ago

Thanks for pushing this! I carry the wavelet matrix paper around with me in my backpack, but there were other fires that popped up that need to be put out first. 😓

I've also recently encountered the endianness issue while working with candle and model weight storage formats like SafeTensor. The approach of all Rust libraries and storage formats I've encountered seems to be to just assume that all of the relevant platforms are/support little-endian these days.

I would still want to make sure that that is the case and add something like:

#[cfg(not(target_endian = "little"))]
compile_error!("Zero copy is only supported on LE architectures.)

But after some philosophical wrangling with my inner perfectionist, I've concluded that this is a valid approach 😅. There are few BE-only systems supported by Rust, and they are all embedded systems that don't have SIMD and maybe don't even have a file system let alone the ability to mmap files.

Edit: To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate. Due to native number types already being the correct endian thanks to the above check, they don't need to use the byteorder types, and can just use rusts native number types.

Cydhra commented 4 months ago

To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate.

Maybe it is easier when using zerocopy rather than rkyv. However, there is unsafe code in vers that may or may not play nicely with the safety assumptions of zerocopy.

I still strongly prefer to have this issue decoupled from zero-copying (despite what I said two months ago), but I'd be happy to see a proof of concept with zerocopy.

somethingelseentirely commented 4 months ago

Definitely fair. With zerocopy there should be virtually no considerations to make while implementing this, and the struct can be made zero-copy later on. I think the only limitation is that once zero copy is implemented the struct's can't be changed anymore/need to be versioned, but that's not really relevant for implementing the wavelet matrix itself.

Cydhra commented 4 months ago

That is unfortunately already true because of serde (I think). Changes that need to be made are collected in #9