MaxHalford / vose

Cython implementation of Vose's Alias method
MIT License
5 stars 5 forks source link

vose

This is a Cython implemention of Michael Vose's Alias method. It can be used to perform weighted sampling with replacement of integers in O(1) time. It requires a construction phase that runs in O(n) time, with n being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!

I wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's Darts, Dice, and Coins: Sampling from a Discrete Distribution, which is very well written, and decided to use the Alias method. Alas, numpy doesn't seem to have it available, and neither does the random module from Python's standard library. There is, however, the vose_sampler package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's Java implementation.

Installation

pip install vose

Usage

You first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.

>>> import numpy as np
>>> import vose

>>> weights = np.array([.1, .3, .4, .2])
>>> sampler = vose.Sampler(weights, seed=42)

You can then call the .sample() method to sample a random integer in range [0, n - 1], where n is the number of weights that were passed.

>>> sampler.sample()
2

You can set the k parameter in order to produce multiple samples.

>>> sampler.sample(k=10)
array([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])

By default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.

>>> sampler = vose.Sampler(weights, seed=42, copy=False)

Note that vose.Sampler expects to be provided with a memoryview. In order to pass a list, you have to convert it yourself to a numpy array.

>>> weights = [.2, .3, .5]
>>> sampler = vose.Sampler(np.array(weights))

You can also use vose.Sampler from within your own Cython .pyx file:

import numpy as np

cimport vose
cimport numpy as np

cdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)
cdef vose.Sampler sampler
sampler = vose.Sampler(weights)

cdef int sample = sampler.sample_1()
cdef np.int [:] samples = sampler.sample_k(10)

Note that the latter requires having to include the numpy headers in the extension definition of your setup.py:

from setuptools import Extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

extension = Extension(
    '*', ['your_file.pyx'],
    include_dirs=[np.get_include()],
    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]
)

setup(ext_modules=cythonize([extension]))

Is it reliable?

It seems to be working correctly; at least according to the following chi-squared tests:

>>> import numpy as np
>>> from scipy import stats

>>> rng = np.random.RandomState(42)
>>> k = 1000

>>> for n in range(3, 20):
...     weights = rng.dirichlet(np.arange(1, n))
...     sampler = vose.Sampler(weights, seed=42)
...     samples = sampler.sample(k)
...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)
...     assert chi2.pvalue > .05

Is it fast?

Hell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:


As you can see, vose.Sampler takes less than a nanosecond to produce a random integer. Here is the construction time:


vose.Sampler is also fast enough to compete with numpy and random, even when including the construction time. The following table summarizes a comparison I made on my laptop, with n being the number of weights and k the number of integers to sample:

n k np.random.choice random.choices vose.Sampler vose_sampler.VoseAlias
3 1 26ns 2ns 4ns 11ns
3 2 26ns 3ns 7ns 13ns
3 3 26ns 3ns 7ns 14ns
3 10 26ns 6ns 7ns 27ns
3 100 28ns 47ns 8ns 198ns
3 1000 46ns 461ns 19ns 1μs, 887ns
30 1 27ns 6ns 4ns 69ns
30 2 26ns 7ns 7ns 73ns
30 3 27ns 7ns 8ns 72ns
30 10 27ns 14ns 7ns 88ns
30 100 31ns 63ns 8ns 256ns
30 1000 67ns 580ns 19ns 1μs, 935ns
300 1 29ns 47ns 6ns 661ns
300 2 29ns 47ns 9ns 659ns
300 3 29ns 49ns 9ns 685ns
300 10 29ns 54ns 9ns 685ns
300 100 36ns 112ns 10ns 877ns
300 1000 96ns 717ns 20ns 2μs, 599ns
3000 1 52ns 416ns 18ns 6μs, 988ns
3000 2 50ns 420ns 21ns 7μs, 39ns
3000 3 51ns 439ns 21ns 7μs, 102ns
3000 10 51ns 420ns 21ns 7μs, 332ns
3000 100 59ns 496ns 23ns 7μs, 349ns
3000 1000 137ns 1μs, 213ns 35ns 10μs, 190ns

In summary, you probably don't need to be using vose.Sampler if you only need to sample once, regardless of the number of integers you wish to sample. You want to use vose.Sampler when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling np.random.choice or random.choices multiple times.

Development

git clone https://github.com/MaxHalford/vose
cd vose
python setup.py build_ext --inplace
pytest

Further work