droundy / tinyset

Compact sets in rust
Apache License 2.0
53 stars 8 forks source link

Support sets of 8-bit and 16-bit integers #19

Open kytans opened 1 year ago

kytans commented 1 year ago

Support sets of 8-bit and 16-bit integers

droundy commented 1 year ago

These are already supported with Set64<u8> and Set64<u16>. If you are hoping for better optimized versions than that, it might be worth benchmarking the size and speed of what we've already got.

kytans commented 1 year ago

At least in the heap case, not having SetU16 and SetU8 seems like it would result in 2x-4x overhead, which is obviously not acceptable.

It seems like it should be possible to use a macro to generate all SetU#, or maybe even have a generic version for anything integer-like.

droundy commented 1 year ago

No, there wouldn't be overhead, tinyset is optimized for size and for small values. I could do better for u8, e.g. always store a 32-byte dense bitmap when forced onto the heap, but probably not a factor of two better in size.

A Set64<u8> that is forced to the heap would either hold a dense bitmap when an extra 16 bytes storing the maximum value and the number of elements, or would store a few u64 holding bitmaps in their most significant bits, and an offset in the least significant bits, plus a 16 bytes to hold the number of elements and the maximum value. Which it holds would depend on how many elements there are relative to the maximum value in the set. I'm not sure off the top of my head how big the result would be, which is why I suggested starting with a benchmark.

You could also do better when not on the heap (for u8 since Set64 can only store 7 values without going to the heap, but you could do a little better if you knew no values exceeded 255.

For u16 it's very hard to imagine improving matters with a custom implementation, depending on the pattern of numbers stored. To do better than Set64<u16> you'd need to store numbers that are spaced by more than 60 or so, so that the bitmaps mostly hold only a single element. One assumption on Set64 is that values will tend to be clumped. If they aren't clumped then we will indeed take 64 bits per element.

A macro to generate similar implementations for each integer type wouldn't be practical because you aren't going to be using your SetU8 on an 8-bit computer, so you'd still want 64-but bitmaps. SetU32 is primarily implemented to enable SetUsize to work efficiently on 32-bit computers.