Open droundy opened 5 years ago
There is RoaringTreemap
provided to support u64
data by building a BTreeMap
of RoaringBitmap
s (this is one of the 3 64-bit extensions detailed in Nouveaux modèles d’index bitmap compressés à 64 bits).
It could be possible to have an implementation that switches between a basic RoaringBitmap
or RoaringTreemap
based on current architecture to support usize
, but I haven't thought through on what sort of issues that would have.
That sounds rather heavy. I gather this is not optimized in any way for small sets? My problem is that I anticipate that for every large N set, there will be an equally large number N of small sets, so wasting a few words on the small sets is liable to be devastating.
It's been a long time since I looked through that paper, but skimming it again they appear to claim similar density across all three variants for uniformly distributed values. There is a mention of testing the different variants on real datasets in the conclusion, but I don't know if that ever happened.
I wasn't thinking of different densities, but rather of the size of a set containing zero, one, or <10 elements. An empty BTreeSet by itself is rather large, and a BTreeSet with one element as I recall is pretty huge. I'm wondering at what number of elements your data structure could be considered compact.
Ah, you mean you’re actually going to have a large number of full sets around, I thought you were referencing the inner sets used to store each partition of the treemap.
I actually have no idea how large an almost empty BTreeMap
is, an empty one does not allocate any storage, but presumably there’s some relatively large initial allocation once it’s in use to avoid reallocation.
That’s where a Roaring2Level
implementation might win, it would be using just a couple of nested Vec
s for small numbers of elements. But I don’t think the roaring family of data structures as a whole are really optimised for small sets, they’re more targeting large but sparse sets where a bit of extra constant overhead doesn’t matter, it might make sense to have another variant where the entire set is just stored as a Vec<u64>
until it exceeds some threshold.
It's not just the allocated heap storage, but also the "stack" storage, which is likely also on the heap. Looks like an empty BTreeMap is 24 bytes. I expect adding a single element gains another order of magnitude.
This sounds great, but I'd love a container for u64 or usize data. Any chance that you'll come out with a variant to handle larger numbers?