Heuristic for a faster contains

LucaCappelletti94 commented 3 years ago

Hello and thank you for creating this library.

I wanted to ask you if the contains operation could be made faster by pre-computing while building the bitmap the minimum and maximum value and before executing the binary search first check if the given value is within the min-max range.

If this is reasonable I will proceed to implement it and send a pull request.

Best, Luca

richardstartin commented 3 years ago

@LucaCappelletti94 computing the first and last (i.e. min and max respectively) bit in a roaring bitmap is trivial and never involves a binary search. You need to find the first (last) 16 bit prefix, (random) access the first (last) container, and then get the first (last) bit from the container. For array containers and run containers, this is constant time. For bitmap containers it is a linear scan which can be vectorised.

However, you can avoid the binary search in more cases without knowing the first or last bit in the bitmap by comparing the high 16 bits of the input with the first and last prefixes mentioned above: if it’s outside that range, you know it’s not in the bitmap. If it equals either, you already know which container to check for the low bits. If it’s in between, you need to binary search the prefixes.

LucaCappelletti94 commented 3 years ago

Hi @richardstartin, thank you for your answer.

I haven't understood if this solution is already in place (in this or other implementations of RoaringBitmap).

If it isn't, do you believe it would be an improvement adding it?

richardstartin commented 3 years ago

@LucaCappelletti94 it depends. If you frequently call contains with values smaller than the first bit or greater than the last bit, this would clearly be beneficial. Otherwise not. I certainly wouldn't want to increase the size of the bitmap by 8 bytes to support this optimisation, but doing it in the high bits as mentioned above makes sense to me.

The Java implementation doesn't do this, I haven't read the source code in this implementation, perhaps @lemire knows if any other implementation does this. The Java and C implementations implement first and last as outlined above, which allow the caller to apply the heuristic.

LucaCappelletti94 commented 3 years ago

Yes, that would definitely be our use case (currently we are doing the mentioned check before execution the contains), so we thought to integrate it within the bitmap.

I'll wait for @lemire's feedback before proceeding with the implementation.

lemire commented 3 years ago

Yes, that would definitely be our use case (currently we are doing the mentioned check before execution the contains), so we thought to integrate it within the bitmap.

Well, in the contains , when you do the binary search, you can "cheaply" check, before your recurse, whether the value is in the range of the "high bits". E.g., if you are looking for x a sorted array "A". It is entirely reasonable for an implementation to first check that "x" is larger than or equal to "A[0]" and smaller than or equal to "A[-1]". This means that you can come up with a non-invasive, cheap check that precedes the binary search. It is certainly allowable. It probably translates into 2 or 3 lines of code?

It is unlikely to be very harmful in the worst case, and may help in some cases. Whether it helps your current use case, well, did you benchmark it?

If you can't measure a sizeable gain, then I would not bother with this optimization because you don't want to be adding code with no substantiated benefit.

LucaCappelletti94 commented 3 years ago

Hi @lemire, me and @zommiommy implemented the version you suggested and in our use case we passed from 108ms ± 209 μs to 92.5 ms ± 169 µs over 70 runs.

It seems like a sizeable gain to us since this is executed in one of the methods of the core loop of our codebase.

If you concur that what I have described is a sizeable gain we can proceed with a pull request.

lemire commented 3 years ago

I would agree that such a gain makes a few extra lines of code worth it.

richardstartin commented 3 years ago

@LucaCappelletti94 that's what I suggested above. In any case, you're missing a trick - when the 16 bit prefix of the input and first or last prefix of the bitmap match, you would perform a binary search when you have already located the right container.

LucaCappelletti94 commented 3 years ago

Hi @richardstartin, we are experimenting with the tweak you have rightly suggested and so far, possibly thought to the peculiarity of our use case (the bitmaps are quite big and it is unlikely to execute a contains of the first or last value) we are getting slightly worse performance when including the check you proposed. We will continue checking our code, possibly we are doing something wrong in the edit we are doing.

lemire commented 3 years ago

we are getting slightly worse performance when including the check you proposed

Introducing more branch paths can make the results worse.

But you should be concerned with the quality of your experiments. Make sure that you are using really large volumes. Processors have crazily good branch predictors and if your workload is too small, you could be getting bogus results. See Making Your Code Faster by Taming Branches.

LucaCappelletti94 commented 3 years ago

After executing benchmarks on different computers with different CPUs, we noticed that the speed up existed only in some models. In the others, the heuristic slightly slowed down the average performance (by 2-3%).

Since it does not seem to be a general speed up but specific for CPU model (and task), how would you prefer to proceed?

lemire commented 3 years ago

@LucaCappelletti94 Details may matter. This is such simple code that it should be possible to look at the assembly without going blind. If you give a sketch of the function you are testing, we ought to be able to look at it. This is important because if you do not know what the Rust compiler does with your heuristic, you might be working from an erroneous mental model.

Furthermore, if it is faster "contains" that you are after, it might pay to have a closer look at the current code and how it gets compiled. For example, does the "contains" call get inlined in your code? If not, ensuring that it gets inlined could make a huge difference.

Or, maybe, take a step back and look at how the contains is used. What is your high-level algorithm?

LucaCappelletti94 commented 3 years ago

We are implementing a fast version of Node2Vec, so our core (among other things) at every iteration, given two sorted vectors (whose size follows a geometric distribution, as most vectors are small and some are over hundred of thousands), has to compute the mask of elements in the first vector that appear also in the second vector and use the obtained mask to multiply a vector of weights.

We are using godbolt to better understand what the compiler is doing in the most compute-heavy methods and perf to count the number of branch miss-predictions.

We are trying to identify the best possible method to do this operation quickly and we are exploring, among other methods, RoaringBitmap. In particular, we are looking at this paper on Fast Set Intersection Algorithm for Sorted Sequences.

Have you already tackled this problem?

lemire commented 3 years ago

Hopefully, you are not computing the intersection between two bitmaps using "contains" calls?

Yes, I have worked on the intersection of sorted arrays... see for example...

SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016

More to the point, if you care about Roaring...

Roaring Bitmaps: Implementation of an Optimized Software Library Software: Practice and Experience 48 (4), 2018

In this last paper, there is a section (4.2) on how to optimize the intersections when using Roaring Bitmaps.

RoaringBitmap / roaring-rs

Heuristic for a faster contains #71