Chriscbr / thinset

A data structure for sparse sets of unsigned integers.
Apache License 2.0
2 stars 1 forks source link

Improve SparseMap/SparseSet space usage when keys are u32 or smaller #20

Open Chriscbr opened 7 months ago

Chriscbr commented 7 months ago

If you know the elements in your set will be "small" -- say, always between 0 and 2^32 -- the current implementation of SparseMap/SparseSet is space inefficient since the sparse array stores a Vec<usize> when a Vec<u32> would be sufficient.

One might even argue 2^32 is actually more space than sufficient for most use cases (like storing sets of IPv4 addresses), so this should be the default. The main downside is once you commit to a smaller keyspace, like u32, that's a hard bound on the map's max capacity.

It's also possible you might want to use the even smaller keyspaces of 2^16 or 2^8 (to store u16 or u8 values).

What's the best way to support this?

thass0 commented 7 months ago

Are you talking about the number of elements in the set or the size of the set's element?

Using smaller indices than usize if the size of the set's elements is also smaller than usize

In the this case, using key-sized indices in sparse seems like a good idea. All keys are unique, so, if I'm not mistaken, it's impossible to store more than $2^{32}-1$ key-value pairs in a map of u32s (or keys of any other size that's smaller than usize). Thus, storing u32 indices in sparse is guaranteed to be sufficient to store entry for each key even. If this is the case you meant, I think locking in to a smaller keyspace is also not too bad, since that's already the case due to the key-uniqueness guarantee.

Using smaller indices than usize if the number of elements in the set is expected to be small

Using indices in sparse that are even smaller than the key size would also be an interesting idea. However, it might create a pretty severe lock-in later on when inserting more entries. In the worst case, at some point the entire sparse array would need to be copied to use indices of a bigger size. For example, if sparse starts out as a Vec<u16>, it's conceivable that it needs to be resized to a Vec<u32> later on. I imagine doing so without changing the type of set whose sparse is resized would be very difficult. I'm not sure if this increase in complexity is worth it, given that we already resize sparse automatically depending on the number of entries.

Chriscbr commented 7 months ago

Thanks for clarifying. Yeah -- the first case (using smaller indices than usize if the size of the set's elements is also smaller than usize) was the one I had in mind, basically relying on the key uniqueness as you described quite well.

An option that keeps things simpler (without generics etc.) at the sacrifice of some flexibility could be to only have u32 and u64 implementations -- they could be named SparseSet and BigSparseSet for example, with corresponding map types. SparseSet's capacity would be restricted to u32::MAX, and BigSparseSet's capacity would be restricted to u64::MAX. Macros can be used to reduce code duplication etc.