Open LucaCappelletti94 opened 4 years ago
@LucaCappelletti94 computing the first and last (i.e. min and max respectively) bit in a roaring bitmap is trivial and never involves a binary search. You need to find the first (last) 16 bit prefix, (random) access the first (last) container, and then get the first (last) bit from the container. For array containers and run containers, this is constant time. For bitmap containers it is a linear scan which can be vectorised.
However, you can avoid the binary search in more cases without knowing the first or last bit in the bitmap by comparing the high 16 bits of the input with the first and last prefixes mentioned above: if it’s outside that range, you know it’s not in the bitmap. If it equals either, you already know which container to check for the low bits. If it’s in between, you need to binary search the prefixes.
Hi @richardstartin, thank you for your answer.
I haven't understood if this solution is already in place (in this or other implementations of RoaringBitmap).
If it isn't, do you believe it would be an improvement adding it?
@LucaCappelletti94 it depends. If you frequently call contains
with values smaller than the first bit or greater than the last bit, this would clearly be beneficial. Otherwise not. I certainly wouldn't want to increase the size of the bitmap by 8 bytes to support this optimisation, but doing it in the high bits as mentioned above makes sense to me.
The Java implementation doesn't do this, I haven't read the source code in this implementation, perhaps @lemire knows if any other implementation does this. The Java and C implementations implement first and last as outlined above, which allow the caller to apply the heuristic.
Yes, that would definitely be our use case (currently we are doing the mentioned check before execution the contains), so we thought to integrate it within the bitmap.
I'll wait for @lemire's feedback before proceeding with the implementation.
Yes, that would definitely be our use case (currently we are doing the mentioned check before execution the contains), so we thought to integrate it within the bitmap.
Well, in the contains
, when you do the binary search, you can "cheaply" check, before your recurse, whether the value is in the range of the "high bits". E.g., if you are looking for x a sorted array "A". It is entirely reasonable for an implementation to first check that "x" is larger than or equal to "A[0]" and smaller than or equal to "A[-1]". This means that you can come up with a non-invasive, cheap check that precedes the binary search. It is certainly allowable. It probably translates into 2 or 3 lines of code?
It is unlikely to be very harmful in the worst case, and may help in some cases. Whether it helps your current use case, well, did you benchmark it?
If you can't measure a sizeable gain, then I would not bother with this optimization because you don't want to be adding code with no substantiated benefit.
Hi @lemire, me and @zommiommy implemented the version you suggested and in our use case we passed from 108ms ± 209 μs
to 92.5 ms ± 169 µs
over 70 runs.
It seems like a sizeable gain to us since this is executed in one of the methods of the core loop of our codebase.
If you concur that what I have described is a sizeable gain we can proceed with a pull request.
I would agree that such a gain makes a few extra lines of code worth it.
@LucaCappelletti94 that's what I suggested above. In any case, you're missing a trick - when the 16 bit prefix of the input and first or last prefix of the bitmap match, you would perform a binary search when you have already located the right container.
Hi @richardstartin, we are experimenting with the tweak you have rightly suggested and so far, possibly thought to the peculiarity of our use case (the bitmaps are quite big and it is unlikely to execute a contains of the first or last value) we are getting slightly worse performance when including the check you proposed. We will continue checking our code, possibly we are doing something wrong in the edit we are doing.
we are getting slightly worse performance when including the check you proposed
Introducing more branch paths can make the results worse.
But you should be concerned with the quality of your experiments. Make sure that you are using really large volumes. Processors have crazily good branch predictors and if your workload is too small, you could be getting bogus results. See Making Your Code Faster by Taming Branches.
After executing benchmarks on different computers with different CPUs, we noticed that the speed up existed only in some models. In the others, the heuristic slightly slowed down the average performance (by 2-3%).
Since it does not seem to be a general speed up but specific for CPU model (and task), how would you prefer to proceed?
@LucaCappelletti94 Details may matter. This is such simple code that it should be possible to look at the assembly without going blind. If you give a sketch of the function you are testing, we ought to be able to look at it. This is important because if you do not know what the Rust compiler does with your heuristic, you might be working from an erroneous mental model.
Furthermore, if it is faster "contains" that you are after, it might pay to have a closer look at the current code and how it gets compiled. For example, does the "contains" call get inlined in your code? If not, ensuring that it gets inlined could make a huge difference.
Or, maybe, take a step back and look at how the contains is used. What is your high-level algorithm?
We are implementing a fast version of Node2Vec, so our core (among other things) at every iteration, given two sorted vectors (whose size follows a geometric distribution, as most vectors are small and some are over hundred of thousands), has to compute the mask of elements in the first vector that appear also in the second vector and use the obtained mask to multiply a vector of weights.
We are using godbolt to better understand what the compiler is doing in the most compute-heavy methods and perf to count the number of branch miss-predictions.
We are trying to identify the best possible method to do this operation quickly and we are exploring, among other methods, RoaringBitmap. In particular, we are looking at this paper on Fast Set Intersection Algorithm for Sorted Sequences.
Have you already tackled this problem?
Hopefully, you are not computing the intersection between two bitmaps using "contains" calls?
Yes, I have worked on the intersection of sorted arrays... see for example...
More to the point, if you care about Roaring...
In this last paper, there is a section (4.2) on how to optimize the intersections when using Roaring Bitmaps.
Hello and thank you for creating this library.
I wanted to ask you if the contains operation could be made faster by pre-computing while building the bitmap the minimum and maximum value and before executing the binary search first check if the given value is within the min-max range.
If this is reasonable I will proceed to implement it and send a pull request.
Best, Luca