This project is a performance analysis of implementations of Roaring Bitmaps. We will focus on one operation: the union between two roaring bitmaps. This takes place in the Scientific Methodology and Performance Evaluation (SMPE) course of the Grenoble Alpes University.
Roaring bitmaps are a hybrid data structure to represent sets of unsigned 32 bits integers. Their main purpose is to be more efficient, both in memory and in computation time, than classical set implementations like hash sets. They are also more versatile than bitsets, which should not be used when the elements are “too sparse”.
A roaring bitmap is made of a sorted array of “containers”. An element of a roaring bitmap will be represented in one of these containers, depending on its 16 higher order bits. The container will therefore hold the 16 lower order bits of the integer (thus, containers can be viewed as sets of unsigned 16 bits integers). The following figure gives a quick overview.
Three types of containers are used:
The type of the container is chosen depending on the number of elements it holds. We recall that there can be at most 2^16 such elements, since these are 16 bits integers.
The following figure gives an example of a roaring bitmap containing:
[0, 2000]
.[2^16, 2*2^16[
.[5*2^16+10, 5*2^16+27]
and all the numbers in [5*2^16+1024, 5*2^16+2048]
.For more details on roaring bitmaps, read http://arxiv.org/abs/1603.06549
The root of this work is the C file roaring_op.c. It can be compiled into an executable which takes as input several parameters (sizes, densities, optimizations), generates randomly two roaring bitmaps accordingly, computes their union and finally outputs the time spent to compute this union.
The Python scripts preliminary_runner.py and size_density_runner.py are used to automatize the experiments, with the Python files utils.py and samplers.py containing some common functions. One experiment consists in compiling both the library and the executable in the desired way, running roaring_op
with the desired parameters and saving the results in a CSV file.
Here, everything is done in the script python_size_density_runner.py, which also uses the files utils.py and samplers.py.
The idea is very similar to what we did for the C analysis, but simpler, since everything remains in Python.
Dependencies:
Two machines were used to get the results.
Broadwell:
Skylake:
Clone the repositories:
git clone --recursive https://github.com/Ezibenroc/roaring_analysis.git
To open the Jupyter notebooks:
jupyter notebook # then click on the desired notebook
Instructions to generate the result files are written in the two notebooks.
This is our first analysis of the C implementation. The aim is to find which factors have a significant impact on the performances.
A previous version of this analysis can be found here.
We have identified the different optimizations that have an impact on performances.
We will now analyze the performances of the C implementation of roaring bitmap unions for various sizes and densities.
A previous version of this analysis can be found here.
Here, we analyze the performances of the Cython and Python implementations of roaring bitmaps, as well as the built-in set.