Cydhra / vers

Succinct data structures using very efficient rank and select
Apache License 2.0
62 stars 4 forks source link

Zero-Copy Serialization/Deserialization #5

Open somethingelseentirely opened 6 months ago

somethingelseentirely commented 6 months ago

The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.

Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.

One might want to implement this via rkyv, but simply providing a safe transmute to and from bytes::Bytes (with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.

Cydhra commented 6 months ago

rkyv looks good, and adding that with an optional dependency (because I am quite keen on keeping it zero-dependencies) might be an option. I'll look further into it, because it seems like a nice additon. Doing it myself by just casting raw data sounds painful though, because while rkyv offloads endianess somewhere into the frontend (i.e. the user has to decide what to do if I read it correctly), I'd have to handle that if I implement it myself (even if I do the same, I still have to keep it in mind). Maybe I am overthinking that though, I don't know.

Cydhra commented 6 months ago

When adding a new serialization framework, it's worth thinking about the serialization-breaking change of reducing stack size by changing Vec into Box in immutable data structures.

somethingelseentirely commented 6 months ago

Endianess is a good point. But I think rkyv handles it somewhat ungracefully, by using feature flags, and I'm not sure what happens when you use two libraries that transitively use both archive_be and archive_le.

Cydhra commented 6 months ago

No, using features is actually convenient for me, because I just disable all features, and let the downstream crate decide.

somethingelseentirely commented 6 months ago

I think the same late binding could be achieved with generic paramethers though? Without the conflict problem where different pieces of code want different endianness. To give an example, my use case requires the ability to reproducibly produce these archives in a bit-perfect manner, so that they can be hashed/checksummed. I could imagine a scenario where most of the system actually wants to go with native endianness for performance, but go with be for those particular datastructures.

Cydhra commented 6 months ago

I mean, as long as you don't import serialized data from systems with opposite endianness, using only native endianness shouldn't create any issues, no?

somethingelseentirely commented 6 months ago

Well if you're using the (de)serialization as way to create a data-exchange/file format (think .jpg not application instance specific .dat), then that format will want to decide on some endianess. In my case it's a file format for knowledge graphs, with the added bonus that you can query it without having to build any indexes first, just mmap and go. So it's always going to be in be.

The Stable Cross-Platform Database File section in the SQLite documentation is probably the best description of that use case. Avoiding breaking changes caused by the way rkyv stores things is also an argument for rolling our own framework agnostic data layout.

Edit: Btw it's also completely fine if such a use-case doesn't align with the project goals πŸ˜„

Cydhra commented 6 months ago

Okay, I see where you are coming from, but exporting into pre-defined file formats using zero-copy serialization seems difficult.

For example, let's pretend you write a Wavelet Matrix that looks like the database file and can be directly serialized and deserialized to and from that format. The rank/select data structures still need to be written to file, if zero-copy deserialization is a goal, which will interleave the data format with junk data. To solve that, the helper data structures needed to be excluded from the Archive format, but then the deserialization process needs to recalculate them, which negates any speedups gained from the zero-copy process.

somethingelseentirely commented 6 months ago

I'm not sure I'm following. Do you mean undefined data caused by padding / unitialised memory? Rkyv for example zeroes these to get determinism and also avoid accidental memory data leakage.

Rkyv has an unsafe way to deserialize without any checks btw, but the default is having a validation/sanity-check step on read. So it's not just transmute and go. I also think that it would be ok to buy cheap deserializability with a more expensive serialization step that does sanitation and cleanup things like memory zero-ing, as I think that most use cases for something like this are write-once-read-many.

On a more philosophical level and ignoring the more difficult mutable/growable succinct datastructures for now, I feel that the static/write-once-ness of most succinct data-structures makes them an interesting special case. I'm not certain myself where I would pinpoint "serialization" in their lifecycle; Is it at the point they are constructed from their mutable "builder" precursor, or is it at the point where one actually calls serialize?

Similarly, having sanitation performed only at serialization might be worthwhile, so that people that don't care about serialization don't have to pay for that, but on the other hand it might actually be cheaper to initialize the datastructure in a sanitized state, e.g. by calling alloc_zeroed.

Cydhra commented 6 months ago

Do you mean undefined data caused by padding / unitialised memory?

No, for example RsVec, the bit vector that supports rank and select (which enables Wavelet Trees), looks like this:

pub struct RsVec {
    data: Vec<u64>,
    len: usize,
    blocks: Vec<BlockDescriptor>,
    super_blocks: Vec<SuperBlockDescriptor>,
    select_blocks: Vec<SelectSuperBlockDescriptor>,
    rank0: usize,
    rank1: usize,
}

And I assume your data format only wants the data bits, but does not take the super_blocks and select_blocks into account (because why would the data format specify those). So if you serialize the WaveletTree into your data format, the actual bits in the vector (data) will sit next to all the bits of the supporting structures (blocks, super_blocks, select_blocks), which don't belong in your target format.

There are actually even more problems to this approach:

So all in all, this is unfortunately not an easy issue, a lot of code needs to be written to support rkyv with a big endian serializer, and manually implement the operations in a way that you can call them on RsVec and ArchivedRsVec.

somethingelseentirely commented 6 months ago

I would question that assumption and ask why shouldn't they be included? They are an essential component of the datastructure and enable the nice time complexities. So not storing them would be a bit like storing only the leafs of a B-Tree. More importantly, since succinct datastructures have a space complexity of ${Z+o(Z)}$ that support data is "asymptotically free", i.e. $\lim {support \over data} = 0$ and it costs "nothing" to store it.

The only pitfall is that the support data should be deterministic.

Your points are related to what I tried to express with my previous musings, wondering at which moment the serialization happens. I don't think that a separation between ArchivedRsVec and RsVec, as Rkyv suggests, is useful as they both have the same immutable-write-once property. In my mind RsVec is more analogous to an ArchivedBitVec, since a writable BitVec is serialized/archived into a read-only RsVec, with an analogous deserialization from RsVec to BitVec.

With that line of thought, I would make RsVec (but not BitVec) parametric over the endinaness <E>, always store E integers in the RsVec's memory/support structures, and have every integer access go through the appropriate conversion.

I suspect this isn't as expensive as it sounds:

And yes, this would probably be a major rewrite of RsVec, or create a new version next to it.

Cydhra commented 6 months ago

I would question that assumption and ask why shouldn't they be included?

Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format).

Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns.

But now it seems you don't want to do that.

I don't think that a separation between ArchivedRsVec and RsVec, as Rkyv suggests, is useful as they both have the same immutable-write-once property.

This is not the reason why rkyv suggests this pattern. Zero-copy can only be achieved if the data structure is both immutable, and does not contain pointers, and no allocating data structures (like Vec).

RsVec is immutable, but it does contain pointers (and currently it also contains Vecs).

Finally, the largest hurdle that ArchivedRsVec tries to overcome is, again, endianness:

If you want to store the vector as big endian and then mmap it into memory, you cannot recreate a RsVec with zero-copy deserialization from it, because RsVec must use the native endianness (which commonly is little endian). So to avoid a copy, you have to keep the data in ArchivedRsVec and handle the endianness by duplicating the functionality of RsVec for that struct.

Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck)

somethingelseentirely commented 6 months ago

Perhaps I still don't understand what you are doing, but it sounded like you want the data structure to serialize into an existing format (you mentioned jpg as an example for a specced format, and I assume you mean a database index format).

No I'm just trying to (de)serialize a bunch of wavelet matrices, but for my own to-be-spec-ed database index format similar to HDT. The JPG analogy was meant to clarify that it is meant as a standardized interchange format, and not an ad-hoc format that can change with every release, but it's custom nevertheless.

Obviously, the support data structures are proprietary and thus do not adhere to any existing format specification, hence my concerns.

Fair point, but I feel like I would have that problem with any implementation, even if I managed to find the most textbook DArray implementation out there. The alternative would be to build my own, but then I'd be having the same problem on top of reinventing the wheel. So it's easier to just go with something that works and then write that into a spec.

I figured, given that I'll have this problem regardless, that I'm just gonna go with the library made in Germanyℒ️ (I'm from Bremen), then it'll at least look nice if we ever co-author a paper, and might give Rust more street-cred in Germany. 🀣

This is not the reason why rkyv suggests this pattern. Zero-copy can only be achieved if the data structure is both immutable, and does not contain pointers, and no allocating data structures (like Vec).

My point was that from an API/user perspective RsVec behaves like an archived version of BitVec, regardless of it's current internal implementation (which would require an additional archival step to get rid of the absolute pointers and to coalesce the allocations).

A Vec cannot be constructed by zero-copy deserialization

Yeah when I said "munged together into a single contiguous allocation" I meant that the reworked implementation would get rid of the Vecs and manage the memory itself.

Finally, the largest hurdle that ArchivedRsVec tries to overcome is, again, endianness:

I was assuming (based on a quick glance and the way other implementations work) that the implementation uses SIMD only to popcntthe bytes of the lowest block/chunk level, which shouldn't be affected by endianness? In which case doing the conversion on the fly for the other levels should have been feasible, but I should probably dig into the code to see where problems might arise.

Edit: I am not saying it is impossible btw, I am just saying it involves major refactoring of the entire code base, and a lot of efforts to keep the efficiency (because having more indirections would suck)

Sure sure, my hunch is that it might be a zero sum game, some perf improvements from removing vector bounds checks, some slowdown from having to do some on the fly endianness conversion. But I'll probably know more once I've properly read through the entire codebase. I'm aware that it would take some major rework. I'm also definitely not asking you to do it, I'd do it myself as part of the wavelet matrix. πŸ˜„

Cydhra commented 6 months ago

I pushed some changes to a new branch dev_zero_copy.

All functionality of RsVec is now moved to traits that abstract over the data layout. This way it should be possible to generate an ArchivedRsVec (gated behind a crate-feature) that also implements the trait.

As it stands, this constitutes a breaking change, because you now have to import the trait to access methods on RsVec. This could be fixed by renaming the trait methods and then providing convenience functions in RsVec.

somethingelseentirely commented 6 months ago

Awesome, I'll check it out asap!

somethingelseentirely commented 2 months ago

I just started some work on a minimalist approach to this, using a combination of techniques from minibytes to allow for extensible Bytes definitions (to allow for byte sources like mmaped files), and zerocopy to be able to use Vec<T> as a zerocopy byte source when T itself conforms to the FromBytes/AsBytes traits from zerocopy.

This would still require some form of archive format (i.e. a header with length and layer count), but most of the data (data/blocks/...) can just be sliced from the binary data source.

The main reason I'm writing this comment though is that I've noticed that

pub struct WaveletMatrix {
    data: Box<[RsVec]>,
    bits_per_element: u16,
}

appears to have the information about the number of layers redundantly. Isn't bits_per_element == data.len(), in which case it would be sufficient to store only the data smartpointer?

Is this an atavism from a version in which data was only a single RsVec like most other implementations that I've seen? In which case my guess would be that the split is because of performance as it allows you to remove some counting ops in the new layers?

Really awesome implementation btw, I'm super impressed. It smells like you were able to justify working on it as part of your PhD πŸ˜†

Cydhra commented 2 months ago

Oh damn, didn't even realize that. I was too occupied not putting num_elements into the struct to notice that bits_per_element is also not required. But it actually cancels out, because I also forgot to add serde support to the WaveletMatrix, so I can still change it without breaking anything.

data was only a single RsVec like most other implementations that I've seen?

Really? I thought about doing that, but it seemed counterproductive to me, as it would complicate the rank-query patters with no real benefit. The layers are still so far apart that cache efficiency isn't increased. And the small memory gains due to packed RS-support structs are negligible.

It smells like you were able to justify working on it as part of your PhD

No, I handed in my master thesis, and therefore had some free time. This ends Monday, though, so let's see if I can find that justification eventually.

somethingelseentirely commented 2 months ago

as it would complicate the rank-query patters with no real benefit

You're right, I just checked again, and I remembered wrong πŸ˜…, it's actually only the terminusdb implementation that does it like that. πŸ˜…

I think the reason they do it is because it makes loading from memory a bit more straightforward, but I agree that that is probably not a worthy tradeoff.

I handed in my master thesis

Congratulations! βœ¨πŸŽ‰ πŸ»πŸŽ‰βœ¨

somethingelseentirely commented 2 months ago

Whops, sorry for fat-finger closing the issue πŸ™‡β€β™‚οΈ. I've created a pull request that removes bits_per_element.

somethingelseentirely commented 2 months ago

I've come to the conclusion that APIs should in general return usize*

*but use specific sizes in the structs πŸ˜†

Could we fixate the SuperBlockDescriptor and SelectSuperBlockDescriptor fields from usize to concrete types?

SuperBlockDescriptor zeroes should also fit into an u16 with the current?

const SUPER_BLOCK_SIZE: usize = 1 << 13;

The SelectSuperBlockDescriptor is a bit trickier, (2^(32 + 13)) bits = 4.39804651 terabytes, that's right on the edge of being reasonably large enough for encoding all real world consecutive runs of 0s or 1s, but it's also small enough that one can construct a theoretical counter example on a supercomputer or in a mmaped file. So it might be better to err on the side of caution and use an u64?

The SelectSuperBlockDescriptor is also the only thing that's giving me a bit of a headache right now in terms of serialisation because it's the only thing where the size is dependent on the data distribution and not only the length of the bitvector. πŸ˜…

Lastly the fields len, rank0, rank1 of RsVec should all be u64.

Btw this should read

/// The indices do not point into the bit-vector, but into the ~~select~~super-block vector.

right?

Cydhra commented 2 months ago

SuperBlockDescriptor zeroes should also fit into an u16 with the current?

No, the counter isn't reset between super-blocks, lest you'd need to generate a prefix-sum of all super-blocks before the query.

So it might be better to err on the side of caution and use an u64?

Lastly the fields len, rank0, rank1 of RsVec should all be u64.

Yes. I'll add it to #9.

Btw this should read...

Yes

The SelectSuperBlockDescriptor is also the only thing that's giving me a bit of a headache right now in terms of serialisation because it's the only thing where the size is dependent on the data distribution and not only the length of the bitvector.

Can't you just add a value to the file header that tells the loading code how big the slice is?

somethingelseentirely commented 2 months ago

No, the counter isn't reset between super-blocks, lest you'd need to generate a prefix-sum of all super-blocks before the query.

Ah gotcha, sorry u64 it is then πŸ˜„

Yes

I'll create a PR

Can't you just add a value to the file header that tells the loading code how big the slice is?

True, it's just something that needs to be stored per layer, so Header | Blocks becomes something like Header | PerLevelHeader | Blocks, not really a big deal though πŸ˜….

Btw which memory layout do you prefer? Storing the Block/SuperBlock/...s so that each level is stored consecutively [([B], [SB], [SSB])], or by block type so that each type is stored consecutively regardless of level origin([B], [SB], [SSB]). My hunch is that there is more locality level-wise for small wavelet matrixes? But if you had something like slab allocation planned for the future, it might make sense to have things with the same size in the same allocation?

Cydhra commented 2 months ago

When you load a WaveletMatrix struct, the Box and Vec members contain their sizes. How do the zerocopy crates load a vec? More specifically, how is the vec pointer recovered? Or is the archived vec continuous with its header, without indirection?

somethingelseentirely commented 2 months ago

Maybe this requries some preliminaries ^^'

I can only speak for the way I would do it with the anybytes library I adapted from minibytes. I'm personally not a fan of pointers in zerocopy data (Rkyv for example stores relative pointers from what I understand), because they need to be recursively validated, so validation is potentially $$O(n)$$. They are also generally expensive.

What I do try to do is store the data as flat and with as little "dynamic" data as possible, using zerocopy datatypes that don't have invalid bit-patterns. You end up with a format that is not generic like rkyv or serde, but imho much cleaner/nicer. Structs that implement zerocopy::FromBytes/zerocopy::AsBytes are really nice in this regard. You only need a correctly aligned, correctly sized byte slice and parsing becomes a (safe) pointer transmute. You might get a junk value if you read junk, but it will never be an invalid or unsafe value. Similarly zerocopy allows you to convert &[u8] to &[T] based on the length of the byte slice.

Note: What the anybytes, minibytes and ownedbytes crates all do is store an "owner" of the byte source in an Arc, and then keep a slice ref/pointer next to that owner in a smart pointer like struct. The source needs to have a stable address that it derefs to, but this way you get something that has all the advantages of an (immutable) Arc, with the performance of a reference.

What I did there with the PackedSlice is simply wrap one of these Bytes objects annotated with some PhantomData zerocopy type, to allow for convenient and safe conversion to a reference of the type. So you can safely transmute to and from a vector of the thing (interpreted as a byte owner), to the bytes, and back.

So: In the case of the wavelet matrix I would store the number of bits and layers in a header, followed by a slice of select super block counts for each layer (with the length of the layer count). And then the blocks in one of the above orders. When parsing the header one would calculate the number of bytes each slice should take up based on the previously read header information.

Edit: The crux is, that the serialised and deserialised types differ. Just like you can cheaply convert a String to a &str, you can cheaply convert a Vec<T> to a PackedSlice<T>, but while the former are mutable, the latter is always immutable (because the source might not be a Vec, but a mmap after you've written the bytes to disk and read them back).

Cydhra commented 2 months ago

I just misread your question πŸ˜….

I don't plan on adding slab allocation or anything, because this isn't a dynamic data structure, and thus it shouldn't be worth messing with allocation here.

The only thing you should keep in mind is that I do plan on merging blocks and super_blocks into a singular structure and vector (#9), since you only need to access the blocks inside one super block, and thus you can increase locality with that.

I also don't think changing the layout here gets any improvements in locality, because for that, multiple block-lists would have to share cache lines, which shouldn't happen for non-trivial vectors. So for simplicity, [([B], [SB], [SSB])] sounds better.

somethingelseentirely commented 2 months ago

Gotcha πŸ˜„!

I do see how that's potentially nice for rank, but wouldn't it significantly reduce the locality in select operations though, because there you have to binary and linearly search through the super_blocks?

Cydhra commented 2 months ago

Good point, that needs to be evaluated during implementation

somethingelseentirely commented 2 months ago

I did some thinking. Maybe it's not a good idea to try to provide the zerocopy serialization/deserialization as a pre-packaged one-size-fits-all solution. Someone designing a on-disk file format, will have thoughts and insights about the layout themselves. For example if you store multiple wavelet-matrixes with the same alphabet but different orderings, you can share that information between them, whereas an implementation provided by us would have to replicate that into every serialised instance.

So I think it might be better to make the internals part of the public interface, and expose constructors that can take something like the generic read-only PackedSlice<...>s.

I mean the cool thing about succinct vectors is that they are somewhat simple Data-structures, they are somewhat canonical by construction (modulo hyperparameters like superblock ranges).

Cydhra commented 2 months ago

so you suggest just a from-raw-parts layer, maybe some conversions with common libs and then let downstream crates handle it?

somethingelseentirely commented 2 months ago

Yeah exactly. Focusing on making the raw parts stable and zerocopy seems to be the better time-invest, compared to trying to create a full collection type that's just gonna be part of some larger thing in a downstream crate anyways. I mean low-level stuff like succinct data-structures don't really exist outside of some specialised application that composes them into something useful I guess.

Cydhra commented 2 months ago

Yeah, that actually sounds reasonable, and gets rid of a lot of inelegant decision-making. And the raw-parts are pseudo-stable anyway, since I don't want to break serde compatibility between versions.

This also means that it's probably possible to implement the necessary parts of this issue without a major version bump, theyll just break alongside everything else when I do one.

somethingelseentirely commented 2 months ago

Maybe. It requires changes to the collections that own the bytes holding the data/blocks from a Vec to something with shared ownership of the (read-only) bytes. So:

pub struct RsVec {
    data: Vec<u64>,
    len: usize,
    blocks: Vec<BlockDescriptor>,
    super_blocks: Vec<SuperBlockDescriptor>,
    select_blocks: Vec<SelectSuperBlockDescriptor>,
    rank0: usize,
    rank1: usize,
}

Becomes

pub struct RsVec {
    len: usize,
    pub(crate) rank0: usize,
    pub(crate) rank1: usize,
    data: PackedSlice<u64>,
    blocks: PackedSlice<BlockDescriptor>,
    super_blocks: PackedSlice<SuperBlockDescriptor>,
    select_blocks: PackedSlice<SelectSuperBlockDescriptor>,
}

Or any other ByteOwningThing instead of PackedSlice that you prefer (I'm willing to write the missing docs and bump the anybytes crate to 1.0.0 if you only want non-pre-release dependencies).

I think it should serialise similarly with Serde, but it would probably require a custom (de)serializer implementation that creates a byte owning Vec and casts that to a PackedSlice.

somethingelseentirely commented 2 months ago

It also just occured to me that zerocopy and serde could also just be mutually exclusive features (if we don't want to write custom (de)serializers or it turns out that it's not possible for whatever reason).

Cydhra commented 2 months ago

I think it would be best if zerocopy/anybytes was an optional feature that would replace the struct. This way everything stays normal and optimized for everyone who doesn't need this feature, and we only lose compatibility between the serde-version and the packed-version

somethingelseentirely commented 2 months ago

Yeah I think that's fair.

I did some preliminary benchmarks, and saw no impact on performance (the PackedSlice is essentially just a regular fat pointer, so not any different from a Vec and the safe transmute is also zero cost*), but the #[repr(C)] on blocks alone might make a difference in optimisability, so I can't guarantee that it won't have an impact in every case.

One interesting side effect, wrt. performance, of using the PackedSlices is that cloning RsVecs actually becomes super duper cheap, but I'm not sure if that has any merit πŸ˜†

I also just realized that there is a third option of me implementing serde for PackedSlice, I think serde doesn't distinguish between sequence types in the serialized output, so it might be that such an implementation was automatically backwards compatible with the current Vec based approach.

I've created a draft pull request to make discussions about the code easier at #14. Feel free to take it for a benchmark spin. I currently don't have access to a large x86, only an aarch64.

ARM SIMD support is also on my wishlist but that's a different issue πŸ˜‰

Cydhra commented 2 months ago

Annoyingly I check that invariant of the PackedSlice at creation time so subsequent dereferences could just use an unsafe path relying on that invariant

Wouldn't that induce !Unpin?

somethingelseentirely commented 2 months ago

Yes, definitely, but the way anybytes::Bytes works, is that it takes ownership of whatever implements unsafe trait ByteOwner in an Arc. It then calls ByteOwner::as_bytes() -> &[u8] on the thing, with the required invariant that that the returned address must be stable (Pin in a sense, but requiring that as a Bound would only serve as documentation as [u8] itself is trivially Unpin, but it's the reason why the trait is unsafe).

Since PackedSlice is immutable, because Bytes is immutable, we know that that for example an encapsulated Vec<T> where T: AsBytes + FromBytes, will never grow, and thus never change it's reference.

somethingelseentirely commented 2 months ago

I just stumbled upon sux by chance, and their epserde approach is pretty much what we've also decided to use, so there's some prior art there.

I've look at their code and it's less flexible and more cumbersomethan what we have imho, so it's also nice that we can make some improvements in this space.

From friday on I'll be in Sicily for two weeks to harvest some olives, so I might find the time to push this a bit 😁

somethingelseentirely commented 2 months ago

Wouldn't that induce !Unpin?

Good thing we had this convo! It caused me to take a closer look at the code I inherited from minibytes and I actually discovered a use after free scenario where you could get a mutable pointer to an owning Vec and grow it so that the slice no longer points to valid memory. This is fixed in the new 0.6.0 release which I just pushed (It also includes a much cleaner downcast and downgrade/upgrade API).