RoaringBitmap / roaring-rs

A better compressed bitset in Rust
https://docs.rs/roaring/
Apache License 2.0
762 stars 84 forks source link

Add version for u64 data #50

Open droundy opened 5 years ago

droundy commented 5 years ago

This sounds great, but I'd love a container for u64 or usize data. Any chance that you'll come out with a variant to handle larger numbers?

Nemo157 commented 5 years ago

There is RoaringTreemap provided to support u64 data by building a BTreeMap of RoaringBitmaps (this is one of the 3 64-bit extensions detailed in Nouveaux modèles d’index bitmap compressés à 64 bits).

It could be possible to have an implementation that switches between a basic RoaringBitmap or RoaringTreemap based on current architecture to support usize, but I haven't thought through on what sort of issues that would have.

droundy commented 5 years ago

That sounds rather heavy. I gather this is not optimized in any way for small sets? My problem is that I anticipate that for every large N set, there will be an equally large number N of small sets, so wasting a few words on the small sets is liable to be devastating.

Nemo157 commented 5 years ago

It's been a long time since I looked through that paper, but skimming it again they appear to claim similar density across all three variants for uniformly distributed values. There is a mention of testing the different variants on real datasets in the conclusion, but I don't know if that ever happened.

droundy commented 5 years ago

I wasn't thinking of different densities, but rather of the size of a set containing zero, one, or <10 elements. An empty BTreeSet by itself is rather large, and a BTreeSet with one element as I recall is pretty huge. I'm wondering at what number of elements your data structure could be considered compact.

Nemo157 commented 5 years ago

Ah, you mean you’re actually going to have a large number of full sets around, I thought you were referencing the inner sets used to store each partition of the treemap.

I actually have no idea how large an almost empty BTreeMap is, an empty one does not allocate any storage, but presumably there’s some relatively large initial allocation once it’s in use to avoid reallocation.

That’s where a Roaring2Level implementation might win, it would be using just a couple of nested Vecs for small numbers of elements. But I don’t think the roaring family of data structures as a whole are really optimised for small sets, they’re more targeting large but sparse sets where a bit of extra constant overhead doesn’t matter, it might make sense to have another variant where the entire set is just stored as a Vec<u64> until it exceeds some threshold.

droundy commented 5 years ago

It's not just the allocated heap storage, but also the "stack" storage, which is likely also on the heap. Looks like an empty BTreeMap is 24 bytes. I expect adding a single element gains another order of magnitude.