Ezibenroc / roaring_analysis

MIT License
5 stars 1 forks source link

Roaring analysis

This project is a performance analysis of implementations of Roaring Bitmaps. We will focus on one operation: the union between two roaring bitmaps. This takes place in the Scientific Methodology and Performance Evaluation (SMPE) course of the Grenoble Alpes University.

About roaring bitmaps

Roaring bitmaps are a hybrid data structure to represent sets of unsigned 32 bits integers. Their main purpose is to be more efficient, both in memory and in computation time, than classical set implementations like hash sets. They are also more versatile than bitsets, which should not be used when the elements are “too sparse”.

A roaring bitmap is made of a sorted array of “containers”. An element of a roaring bitmap will be represented in one of these containers, depending on its 16 higher order bits. The container will therefore hold the 16 lower order bits of the integer (thus, containers can be viewed as sets of unsigned 16 bits integers). The following figure gives a quick overview.

overview of a roaring bitmap

Three types of containers are used:

The type of the container is chosen depending on the number of elements it holds. We recall that there can be at most 2^16 such elements, since these are 16 bits integers.

The following figure gives an example of a roaring bitmap containing:

example of roaring bitmap

For more details on roaring bitmaps, read http://arxiv.org/abs/1603.06549

About this work - C analysis

The root of this work is the C file roaring_op.c. It can be compiled into an executable which takes as input several parameters (sizes, densities, optimizations), generates randomly two roaring bitmaps accordingly, computes their union and finally outputs the time spent to compute this union.

The Python scripts preliminary_runner.py and size_density_runner.py are used to automatize the experiments, with the Python files utils.py and samplers.py containing some common functions. One experiment consists in compiling both the library and the executable in the desired way, running roaring_op with the desired parameters and saving the results in a CSV file.

About this work - Python analysis

Here, everything is done in the script python_size_density_runner.py, which also uses the files utils.py and samplers.py.

The idea is very similar to what we did for the C analysis, but simpler, since everything remains in Python.

Reproduce this work

Dependencies:

Two machines were used to get the results.

Clone the repositories:

git clone --recursive https://github.com/Ezibenroc/roaring_analysis.git

To open the Jupyter notebooks:

jupyter notebook # then click on the desired notebook

Instructions to generate the result files are written in the two notebooks.

Preliminary C Analysis

This is our first analysis of the C implementation. The aim is to find which factors have a significant impact on the performances.

A previous version of this analysis can be found here.

Size and density C analysis

We have identified the different optimizations that have an impact on performances.

We will now analyze the performances of the C implementation of roaring bitmap unions for various sizes and densities.

A previous version of this analysis can be found here.

Size and density Python analysis

Here, we analyze the performances of the Cython and Python implementations of roaring bitmaps, as well as the built-in set.