driveyard / driveyard

Tools for data-oriented programming
Apache License 2.0
58 stars 1 forks source link

Use a different format for columnar storage #4

Open pickfire opened 3 years ago

pickfire commented 3 years ago

Say, if there is a bool in struct. When columnar storage is used wouldn't it be more useful that we can make use of every single bit rather than using a byte for a bool? Rust compiler optimizes struct itself by rearranging the fields around and use niche where possible but in this case if we changed it to something like Vec rust won't be able to recognize this. Maybe we can have like

#[derive(Columns)]
struct player {
    #[columns(storage = BitField)]
    alive: bool,
}

So we can customize how it is being stored internally, rather than Vec<bool> (or something similar), we have BitField which is much more smaller and each item occupies a single bit.

Other than that, it also allows optimization like comparing the two bitfields and make merge or do stuff there, I think probably pandas or arrow uses this sort of optimization.

By the way, it may be cool to be able to combine rkyv https://github.com/djkoloski/rkyv @djkoloski with this as well but I wonder how that can be done.

rpjohnst commented 3 years ago

Thinking about this a bit, at the RawTable level, this should be possible with a minimal change: separate the element type (F in fn ptr<F>(&mut self, field: Field<T, F>) -> *mut F) from the capacity calculations. This should let ptr return, for example, a *mut u64 to capacity / 64 elements, rather than allocating space for capacity of every field's "actual" storage type.

I can imagine doing this in the derive macro (e.g. #[soak::elements(64)] storage: u64,), or alternatively with some kind of wrapper type (e.g. storage: Elements<64>(u64)) that with_capacity could pick up on. (Though, perhaps that would require specialization?)

A design note for anyone looking to implement this: I would prefer to avoid concepts like bitfields in this layer, but rather describe the memory layout used to implement bitfields (and potentially other formats) directly. So on that note, it might be worth collecting more formats to consider before picking an approach.

pickfire commented 3 years ago

This should let ptr return, for example, a *mut u64 to capacity / 64 elements, rather than allocating space for capacity of every field's "actual" storage type.

It should be usize ideally rather than specific size.

But I don't think wrapper type would be nice since that will remove the original structure and won't be usable without the wrapper type, so with the derive macro it can still work with a single type.

Other than bitfields, we could have one for String, rather than keeping the pointer and length. We have a pool of string, and then for each string we just keep the index for the pool and have an additional separate length. Example, v means index and ~ means total length so we can keep minimal stuff. So with this type of storage we can remove the deletion feature but makes it very good for storage and querying. Maybe?

v --- ptr
| aaoeuu | bueue | cauuo |
^        ^       ^       ~

Other than these two storages, I haven't thought about others.