Open Oppen opened 3 years ago
I guess that you want the cardinality of the union of B1, B2 and B3 together... You could do it without materialization: just iterate through the values in the containers. It is going to be in time loglinear with respect to (card(B1) + card (B2) + card(B3)). But an iterator approach is going to be slow (high constant) so, in practice, materializing intermediate results will be faster. However, you can always do it with, at most, one materialized container per key... especially in lazy mode where you tend to materialize bitmap containers. With enough care, you can reuse the same bitmap containers. It is entirely reasonable.
I wonder if we implemented something of the sort... I will check and report back if we did.
Intersections are another beast. You stand to gain much less by being clever... because the intersection between many bitmaps tends to be tiny compared to the original bitmaps (of course, it is data dependent...).
I can't find an implementation right (in Java or Go).
It is entirely reasonable.
Indeed, that was my reasoning. I didn't think through the intersection part being tinier. I guess worst case is a set of dense intersections where the resulting bitmap may still be big, but deciding whether that's likely would require domain knowledge I think. In the general case, I guess the lazy intersection should save enough memory without sacrificing performance, right?
I think that the motivation for the fast intersection function should be checked with benchmarks.
I'm planning on implementing a feature to do this but I'd like early feedback on the design.
The idea is to use a heap such as in the
roaring_bitmap_or_many_heap
implementation to get pointers to all the containers sorted by key. This way I can compute unions of all containers of a given key in a single bitset container (since the contents will be discarded once I get its cardinality), so the memory use would be about8k + sizeof(auxilliary_struct)*N
, whereN
is the number of containers in the bitmaps passed to my function, andauxilliary_struct
would be some structure with enough data to keep progress (most likely a pointer to a container, but I'll know better when I start coding). Another option is forauxiliary_struct
to keep the lowest unprocessed key for sorting and the pointer to the bitmap, so our heap doesn't need to fit all the containers at once but keeps re-adding elements to the queue if they weren't exhausted. Time complexity should be more or less the same as with an intermediate bitmap for union, but for bigger bitmaps it should save some space.The same can be done for intersection.