OHDSI / bayes-bridge

Bayesian sparse regression with regularized shrinkage and conjugate gradient acceleration
https://bayes-bridge.readthedocs.io/en/latest/
18 stars 7 forks source link

Switch from python builtin random to numpy random functions #2

Closed chinandrew closed 2 years ago

chinandrew commented 2 years ago

Currently, random.random() is called for uniform random numbers, and normal variables are generated via a polar method implementation. Switching these to numpy's equivalent functions should improve performance and have slightly better statistical properties.

Unif: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/polya_gamma.pyx#L24 https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/tilted_stable/tilted_stable.pyx#L45

Polar Normal: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/polya_gamma.pyx#L217 https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/tilted_stable/tilted_stable.pyx#L328

Example of cython code that does this: https://github.com/zoj613/polyagamma/blob/main/polyagamma/_polyagamma.pyx

chinandrew commented 2 years ago

Here's a basic benchmark between the two:

generating 10000000 numbers, averaged over 20 runs
np normal      21.66382449500088
builtin normal 27.002115658004186
np unif        20.510443096995004
builtin unif   7.54976275300578

Not sure if the uniform is slower due to different algorithm or implementation (or both). Normal is noticeably faster. The differences might be minor in the grand scheme of a big MCMC run, but if there's better statistical properties with numpy, that seems desireable.

aki-nishimura commented 2 years ago

The Numpy functions in the above benchmark are not vectorized (nor cythonized), right? In that case, I think most of the costs are coming from function call overheads. On my iMac, I see that Numpy is about 15 ~ 20 times faster when vectorized (i.e. without the function call overhead from Python-C interactions).

My idea was that numpy.random's C-API should improve the performance by getting rid of the Python-C interaction overhead: https://numpy.org/devdocs/reference/random/c-api.html

Performance gain may still be something like 10 ~ 20% though, judging from my crude eyeballing of prun with # cython: profile=True. (I'd be happy to learn if there is a more precise and systematic way of profiling and optimizing C-extensions.) So the improvement in speed and statistical properties would be nice, but perhaps not critical.

chinandrew commented 2 years ago

These are cythonized, drawing individual numbers within a loop in C. I tried to copy the pattern used in the numpy examples and the polyagamma implementation you linked, with the bitgenerator called once (there's some overhead in instantiating it) and then just running a loop of single samples within cython for testing.

Example for standard normal function:

import random
from cpython.pycapsule cimport PyCapsule_GetPointer
from numpy.random import default_rng
from numpy.random cimport BitGenerator, bitgen_t
from numpy.random.c_distributions cimport random_standard_normal, random_standard_uniform

cdef const char* BITGEN_NAME = "BitGenerator"

cpdef double standard_normal(int i):
    cdef BitGenerator bitgenerator
    cdef bitgen_t* bitgen
    bitgenerator = <BitGenerator>(default_rng(None)._bit_generator)
    for _ in range(i):
        bitgen = <bitgen_t*>PyCapsule_GetPointer(bitgenerator.capsule, BITGEN_NAME)
        with bitgenerator.lock:
            random_standard_normal(bitgen)
aki-nishimura commented 2 years ago

running a loop of single samples

Meaning i = 1? (I guess so because np.random.rand(10 ** 7) only takes 50 milliseconds on my iMac.) If so, I am not sure if that benchmark is representative of the performance we would see when we embed random_standard_normal(bitgen) in the polyagamma sampler. annotate = True clearly shows Python-C interaction when random.random() is called, which I believe is the main reason why sampling from uniform/normals takes so much time, and getting rid of that should improve the performance?

chinandrew commented 2 years ago

Ah sorry wasn't clear. i was 10,000,000 in that example. The implementation will probably have the bitgenerator as a class attribute for the polyagamma object + a method to get a random value from it. My computer also is fast when using np.random.rand(10**7), but I did the loop since it seemed like the code is calling the random functions within loops, e.g.

  while not accepted:
        X = self.rand_unit_shape_invgauss(mean)

Agree that this may not be too representative, I'll work on implementing it and running an actual test MCMC to see what happens.

chinandrew commented 2 years ago

Hm initial tests don't show a huge difference when running the following basic example:

import numpy as np

from tests.helper import simulate_data

from bayesbridge.model import LinearModel, LogisticModel, CoxModel
from bayesbridge import BayesBridge, RegressionModel, RegressionCoefPrior

y, X, beta = simulate_data(model='logit', seed=0, n_obs=1000, n_pred=500)
model = RegressionModel(y, X, family='logit')
bridge_exp = .25

# Two samples should agree since the default prior is scale invariant.
prior = RegressionCoefPrior(
    bridge_exponent=bridge_exp,
    regularizing_slab_size=1.,
    _global_scale_parametrization='raw'
)
bridge = BayesBridge(model, prior)

mcmc_output = bridge.gibbs(
        n_iter=10, n_burnin=0,
        coef_sampler_type='cholesky',
        init={'apriori_coef_scale': .1},
        seed=0, n_status_update=0
)

Both implementations land between 8.5 and 10.5 seconds across a bunch of runs. Here's one example - I ran 3-4 without significant differences.

# 9.82 s ± 272 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # original implementation

# 9.57 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # new np implementation

Annotations show potential python interaction in the functions, but it shows even more when I have it annotate the examples numpy provides :shrug:. I'll need to dig a bit deeper to find out what's going on.

A slightly hacky idea that came to mind if for some reason we can't get this working is to trade memory for performance by generating a large list of random values once using np.random.rand(10 ** 7) and then just accessing each value when we need a new random number, regenerating the batch as needed. Would cost a few mb of memory to store, but the generation only takes a few ms instead of a few seconds. Hopefully it doesn't come to that :crossed_fingers:

aki-nishimura commented 2 years ago

Annotations show potential python interaction in the functions, but it shows even more when I have it annotate the examples numpy provides 🤷.

Ah, that's too bad. I thought Numpy's C interface is designed to get rid of Python-C interactions, but...??

The cost of linear algebra is $O(n p^2 + p^3)$ for sampling the regression coefficients from Gaussian while the cost of sampling $n$ polya-gamma auxiliary variables is $O(n)$. So, the polya-gamma sampler cost becomes noticeable when $n$ is substantially larger than $p$, e.g. n = 10^6, p = 10^3. (Incidentally, often we still need regularization when the data is sparse.)

Would the new implementation make any difference if we just benchmark the polyagamma sampler? (Potentially useful piece of code for this purpose: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/test_polyagamma.ipynb)

A slightly hacky idea that came to mind if for some reason we can't get this working is to trade memory for performance by generating a large list of random values once using np.random.rand(10 ** 7) and then just accessing each value when we need a new random number, regenerating the batch as needed.

I too thought about that before. At the time I decided against it because using a suboptimal Unif[0, 1] random number generator seems more maintainable and easier to optimize later. But maybe I am just being too OCD about hacky code.

Another possibility is simply to paste in the C code behind Numpy's random stream: https://www.pcg-random.org/. I don't know C/C++ super well, so I decided not to bother at the moment, but maybe it is not that difficult if you know C/C++.

chinandrew commented 2 years ago

I think I've identified why we're not seeing huge speedups with the individual numpy calls. When it's generating an array of numbers at once, it holds the bitgenerator lock for the entire loop, e.g. something like

cdef double func(...)
    ...
    with bit_generator.lock:
        for i in range(n):
            # generate random number
        return

versus if we want individual numbers, we'd do something like this

cdef double func(...)
    ....
    with bit_generator.lock:
        return  # generate random number

# elsewhere
for i in range(n):
    func()

That hold/release seems to take up a bunch of time. Here's the difference just generating numbers, with random_test2 having the lock statement and random_test omitting it, both in the latter pattern from the above examples. The argument is how many numbers are generated:

In [3]: import random_test2, random_test

In [4]: %time random_test.random_normal_test(10**7)
CPU times: user 224 ms, sys: 261 µs, total: 224 ms
Wall time: 224 ms
Out[4]: -0.870312511920929

In [5]: %time random_test2.random_normal_test(10**7)
CPU times: user 2.25 s, sys: 0 ns, total: 2.25 s
Wall time: 2.24 s
Out[5]: -0.870312511920929

When I did a full MCMC run with y, X, beta = simulate_data(model='logit', seed=0, n_obs=10000, n_pred=50), the lock version took 6.5-7 minutes while the one without took 2.5 min (I did 2 runs each)

I don't entirely know what's going on at a low level (and numpy's lock documentation is excellent :facepalm: . Maybe it's ok if we're always on one thread, but to be safe we probably should keep with numpy's suggestion that "Code that generates values from a bit generator should hold the bit generator’s lock." and keep it in?

There might be marginal gains that could be made by wrapping the while not accepted: statements with the lock, but it would require more involved code changes.

It might not be that messy to do the "hacky" approach as long as we keep the interfaces clean and abstract away the batch generating and caching behaviour from the other functions. I can poke around with it and see.

I also looked into using some of the C code, but I think I'm in the same boat as you and would need some time to figure out the C/C++ side. I haven't used it in ages, and even then only for the most basic of tasks.

chinandrew commented 2 years ago

Couple more benchmarks, generating 10^8 uniform numbers: Without generator lock: 1 s ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) With generator lock: 10.8 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) With cached batches of size 10^7 (so np.random.random is called 10 times). 4.28 s ± 48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It scales pretty linearly, so 10**9 takes 10x longer across all configurations.

aki-nishimura commented 2 years ago

Really useful findings, thanks for looking into these! (I wonder why the caching approach takes so much more time than the no generator lock case, but oh well.)

Hm, the outputs from with and without lock are the same as far as you can tell, correct? I found this stackoverflow answer: https://stackoverflow.com/a/44520130/5721911 which points to this piece of code in an old version of numpy's random generator: https://github.com/numpy/numpy/blob/7cfec2403486456b52b525eccf7541e1562d9ab3/numpy/random/mtrand/mtrand.pyx#L163

Looking at the above makes me think that it is OK to skip the lock unless we start using nogil (which I currently don't do anywhere in the code). I can at least confirm that the BitGenerator.lock is just threading.Lock() for now: https://github.com/numpy/numpy/blob/main/numpy/random/bit_generator.pyx#L500

What do you think?

chinandrew commented 2 years ago

Hm, the outputs from with and without lock are the same as far as you can tell, correct? Yup, at least by comparing the first few thousand outputs.

I found this stackoverflow answer:

https://stackoverflow.com/a/44520130/5721911 which points to this piece of code in an old version of numpy's random generator: https://github.com/numpy/numpy/blob/7cfec2403486456b52b525eccf7541e1562d9ab3/numpy/random/mtrand/mtrand.pyx#L163

Looking at the above makes me think that it is OK to skip the lock unless we start using nogil (which I currently don't do anywhere in the code). I can at least confirm that the BitGenerator.lock is just threading.Lock() for now: https://github.com/numpy/numpy/blob/main/numpy/random/bit_generator.pyx#L500

Interesting, thanks for the links. I think (but am no expert, so there may be unknown unknowns) if we're always on a single thread it should be ok as you said, but we have to be careful if that changes in the future. Depends on how likely we are to change it + the risk we forget, as well as how bad of an issue it'd be if someone used it (e.g. can't reproduce runs with the same seed). The performance improvement certainly makes it appealing. I also wonder if we could somehow test this, like with some assertion that only one thread ever touches the generator. Regardless we should document it well.

aki-nishimura commented 2 years ago

Depends on how likely we are to change it

Given the substantial performance gain from just switching to numpy.random, we won't have to look into multi-threading any time soon, I think. (And I don't quite understand how to parallelize random number generations properly: https://bashtage.github.io/randomgen/parallel.html#.) And these things will be better documented in the future.

as well as how bad of an issue it'd be if someone used it (e.g. can't reproduce runs with the same seed)

Hm, I was thinking that the output will be so wrong that we will find it out pretty soon. Non reproducibility (without obvious catastrophe) would be more insidious.

I also wonder if we could somehow test this, like with some assertion that only one thread ever touches the generator.

Great idea. The closest I have found is this: https://stackoverflow.com/a/25666624/5721911. We could check that the polyagamma sampler has GIL before it starts sampling; this won't prevent release of GIL within a function, but maybe still useful.

Regardless we should document it well.

Agreed.

Overall, I'd say let's go ahead and integrate this into master. The risk seems low and manageable. Also, if we start using it now, we get to better assess any potential issues before more critical deployment of the package.

chinandrew commented 2 years ago

Sounds good, I'll work on opening a PR. Is there any reason you currently have the random functions structured the way they currently are vs having a separate random functions module that tilted_stable and polyagamma import from?

aki-nishimura commented 2 years ago

You mean, why not have one module providing rand_unif and rand_standard_normal? I think it's because 1) I didn't really know that I can cimport modules at compile time when I was a cython newbie (how embarrassing), 2) I thought I might made them into independent modules some day (which never came), and 3) when I was just using random.random(), the code was so simple that duplication didn't look too bad.

In any case, your suggestion to make a separate module makes perfect sense, especially now that we are switching to the numpy-based random generator