Closed chinandrew closed 2 years ago
Here's a basic benchmark between the two:
generating 10000000 numbers, averaged over 20 runs
np normal 21.66382449500088
builtin normal 27.002115658004186
np unif 20.510443096995004
builtin unif 7.54976275300578
Not sure if the uniform is slower due to different algorithm or implementation (or both). Normal is noticeably faster. The differences might be minor in the grand scheme of a big MCMC run, but if there's better statistical properties with numpy, that seems desireable.
The Numpy functions in the above benchmark are not vectorized (nor cythonized), right? In that case, I think most of the costs are coming from function call overheads. On my iMac, I see that Numpy is about 15 ~ 20 times faster when vectorized (i.e. without the function call overhead from Python-C interactions).
My idea was that numpy.random's C-API should improve the performance by getting rid of the Python-C interaction overhead: https://numpy.org/devdocs/reference/random/c-api.html
Performance gain may still be something like 10 ~ 20% though, judging from my crude eyeballing of prun
with # cython: profile=True
. (I'd be happy to learn if there is a more precise and systematic way of profiling and optimizing C-extensions.) So the improvement in speed and statistical properties would be nice, but perhaps not critical.
These are cythonized, drawing individual numbers within a loop in C. I tried to copy the pattern used in the numpy examples and the polyagamma implementation you linked, with the bitgenerator called once (there's some overhead in instantiating it) and then just running a loop of single samples within cython for testing.
Example for standard normal function:
import random
from cpython.pycapsule cimport PyCapsule_GetPointer
from numpy.random import default_rng
from numpy.random cimport BitGenerator, bitgen_t
from numpy.random.c_distributions cimport random_standard_normal, random_standard_uniform
cdef const char* BITGEN_NAME = "BitGenerator"
cpdef double standard_normal(int i):
cdef BitGenerator bitgenerator
cdef bitgen_t* bitgen
bitgenerator = <BitGenerator>(default_rng(None)._bit_generator)
for _ in range(i):
bitgen = <bitgen_t*>PyCapsule_GetPointer(bitgenerator.capsule, BITGEN_NAME)
with bitgenerator.lock:
random_standard_normal(bitgen)
running a loop of single samples
Meaning i = 1
? (I guess so because np.random.rand(10 ** 7)
only takes 50 milliseconds on my iMac.) If so, I am not sure if that benchmark is representative of the performance we would see when we embed random_standard_normal(bitgen)
in the polyagamma
sampler. annotate = True
clearly shows Python-C interaction when random.random()
is called, which I believe is the main reason why sampling from uniform/normals takes so much time, and getting rid of that should improve the performance?
Ah sorry wasn't clear. i
was 10,000,000 in that example. The implementation will probably have the bitgenerator as a class attribute for the polyagamma object + a method to get a random value from it. My computer also is fast when using np.random.rand(10**7)
, but I did the loop since it seemed like the code is calling the random functions within loops, e.g.
while not accepted:
X = self.rand_unit_shape_invgauss(mean)
Agree that this may not be too representative, I'll work on implementing it and running an actual test MCMC to see what happens.
Hm initial tests don't show a huge difference when running the following basic example:
import numpy as np
from tests.helper import simulate_data
from bayesbridge.model import LinearModel, LogisticModel, CoxModel
from bayesbridge import BayesBridge, RegressionModel, RegressionCoefPrior
y, X, beta = simulate_data(model='logit', seed=0, n_obs=1000, n_pred=500)
model = RegressionModel(y, X, family='logit')
bridge_exp = .25
# Two samples should agree since the default prior is scale invariant.
prior = RegressionCoefPrior(
bridge_exponent=bridge_exp,
regularizing_slab_size=1.,
_global_scale_parametrization='raw'
)
bridge = BayesBridge(model, prior)
mcmc_output = bridge.gibbs(
n_iter=10, n_burnin=0,
coef_sampler_type='cholesky',
init={'apriori_coef_scale': .1},
seed=0, n_status_update=0
)
Both implementations land between 8.5 and 10.5 seconds across a bunch of runs. Here's one example - I ran 3-4 without significant differences.
# 9.82 s ± 272 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # original implementation
# 9.57 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # new np implementation
Annotations show potential python interaction in the functions, but it shows even more when I have it annotate the examples numpy provides :shrug:. I'll need to dig a bit deeper to find out what's going on.
A slightly hacky idea that came to mind if for some reason we can't get this working is to trade memory for performance by generating a large list of random values once using np.random.rand(10 ** 7)
and then just accessing each value when we need a new random number, regenerating the batch as needed. Would cost a few mb of memory to store, but the generation only takes a few ms instead of a few seconds. Hopefully it doesn't come to that :crossed_fingers:
Annotations show potential python interaction in the functions, but it shows even more when I have it annotate the examples numpy provides 🤷.
Ah, that's too bad. I thought Numpy's C interface is designed to get rid of Python-C interactions, but...??
The cost of linear algebra is $O(n p^2 + p^3)$ for sampling the regression coefficients from Gaussian while the cost of sampling $n$ polya-gamma auxiliary variables is $O(n)$. So, the polya-gamma sampler cost becomes noticeable when $n$ is substantially larger than $p$, e.g. n = 10^6, p = 10^3. (Incidentally, often we still need regularization when the data is sparse.)
Would the new implementation make any difference if we just benchmark the polyagamma sampler? (Potentially useful piece of code for this purpose: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/test_polyagamma.ipynb)
A slightly hacky idea that came to mind if for some reason we can't get this working is to trade memory for performance by generating a large list of random values once using np.random.rand(10 ** 7) and then just accessing each value when we need a new random number, regenerating the batch as needed.
I too thought about that before. At the time I decided against it because using a suboptimal Unif[0, 1] random number generator seems more maintainable and easier to optimize later. But maybe I am just being too OCD about hacky code.
Another possibility is simply to paste in the C code behind Numpy's random stream: https://www.pcg-random.org/. I don't know C/C++ super well, so I decided not to bother at the moment, but maybe it is not that difficult if you know C/C++.
I think I've identified why we're not seeing huge speedups with the individual numpy calls. When it's generating an array of numbers at once, it holds the bitgenerator lock for the entire loop, e.g. something like
cdef double func(...)
...
with bit_generator.lock:
for i in range(n):
# generate random number
return
versus if we want individual numbers, we'd do something like this
cdef double func(...)
....
with bit_generator.lock:
return # generate random number
# elsewhere
for i in range(n):
func()
That hold/release seems to take up a bunch of time. Here's the difference just generating numbers, with random_test2
having the lock statement and random_test
omitting it, both in the latter pattern from the above examples. The argument is how many numbers are generated:
In [3]: import random_test2, random_test
In [4]: %time random_test.random_normal_test(10**7)
CPU times: user 224 ms, sys: 261 µs, total: 224 ms
Wall time: 224 ms
Out[4]: -0.870312511920929
In [5]: %time random_test2.random_normal_test(10**7)
CPU times: user 2.25 s, sys: 0 ns, total: 2.25 s
Wall time: 2.24 s
Out[5]: -0.870312511920929
When I did a full MCMC run with y, X, beta = simulate_data(model='logit', seed=0, n_obs=10000, n_pred=50)
, the lock version took 6.5-7 minutes while the one without took 2.5 min (I did 2 runs each)
I don't entirely know what's going on at a low level (and numpy's lock documentation is excellent :facepalm: . Maybe it's ok if we're always on one thread, but to be safe we probably should keep with numpy's suggestion that "Code that generates values from a bit generator should hold the bit generator’s lock." and keep it in?
There might be marginal gains that could be made by wrapping the while not accepted:
statements with the lock, but it would require more involved code changes.
It might not be that messy to do the "hacky" approach as long as we keep the interfaces clean and abstract away the batch generating and caching behaviour from the other functions. I can poke around with it and see.
I also looked into using some of the C code, but I think I'm in the same boat as you and would need some time to figure out the C/C++ side. I haven't used it in ages, and even then only for the most basic of tasks.
Couple more benchmarks, generating 10^8 uniform numbers:
Without generator lock:
1 s ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With generator lock:
10.8 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With cached batches of size 10^7 (so np.random.random
is called 10 times).
4.28 s ± 48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It scales pretty linearly, so 10**9 takes 10x longer across all configurations.
Really useful findings, thanks for looking into these! (I wonder why the caching approach takes so much more time than the no generator lock case, but oh well.)
Hm, the outputs from with and without lock are the same as far as you can tell, correct? I found this stackoverflow answer: https://stackoverflow.com/a/44520130/5721911 which points to this piece of code in an old version of numpy's random generator: https://github.com/numpy/numpy/blob/7cfec2403486456b52b525eccf7541e1562d9ab3/numpy/random/mtrand/mtrand.pyx#L163
Looking at the above makes me think that it is OK to skip the lock unless we start using nogil
(which I currently don't do anywhere in the code). I can at least confirm that the BitGenerator.lock is just threading.Lock() for now:
https://github.com/numpy/numpy/blob/main/numpy/random/bit_generator.pyx#L500
What do you think?
Hm, the outputs from with and without lock are the same as far as you can tell, correct? Yup, at least by comparing the first few thousand outputs.
I found this stackoverflow answer:
https://stackoverflow.com/a/44520130/5721911 which points to this piece of code in an old version of numpy's random generator: https://github.com/numpy/numpy/blob/7cfec2403486456b52b525eccf7541e1562d9ab3/numpy/random/mtrand/mtrand.pyx#L163
Looking at the above makes me think that it is OK to skip the lock unless we start using
nogil
(which I currently don't do anywhere in the code). I can at least confirm that the BitGenerator.lock is just threading.Lock() for now: https://github.com/numpy/numpy/blob/main/numpy/random/bit_generator.pyx#L500
Interesting, thanks for the links. I think (but am no expert, so there may be unknown unknowns) if we're always on a single thread it should be ok as you said, but we have to be careful if that changes in the future. Depends on how likely we are to change it + the risk we forget, as well as how bad of an issue it'd be if someone used it (e.g. can't reproduce runs with the same seed). The performance improvement certainly makes it appealing. I also wonder if we could somehow test this, like with some assertion that only one thread ever touches the generator. Regardless we should document it well.
Depends on how likely we are to change it
Given the substantial performance gain from just switching to numpy.random, we won't have to look into multi-threading any time soon, I think. (And I don't quite understand how to parallelize random number generations properly: https://bashtage.github.io/randomgen/parallel.html#.) And these things will be better documented in the future.
as well as how bad of an issue it'd be if someone used it (e.g. can't reproduce runs with the same seed)
Hm, I was thinking that the output will be so wrong that we will find it out pretty soon. Non reproducibility (without obvious catastrophe) would be more insidious.
I also wonder if we could somehow test this, like with some assertion that only one thread ever touches the generator.
Great idea. The closest I have found is this: https://stackoverflow.com/a/25666624/5721911. We could check that the polyagamma sampler has GIL before it starts sampling; this won't prevent release of GIL within a function, but maybe still useful.
Regardless we should document it well.
Agreed.
Overall, I'd say let's go ahead and integrate this into master. The risk seems low and manageable. Also, if we start using it now, we get to better assess any potential issues before more critical deployment of the package.
Sounds good, I'll work on opening a PR. Is there any reason you currently have the random functions structured the way they currently are vs having a separate random functions module that tilted_stable and polyagamma import from?
You mean, why not have one module providing rand_unif
and rand_standard_normal
? I think it's because 1) I didn't really know that I can cimport
modules at compile time when I was a cython newbie (how embarrassing), 2) I thought I might made them into independent modules some day (which never came), and 3) when I was just using random.random()
, the code was so simple that duplication didn't look too bad.
In any case, your suggestion to make a separate module makes perfect sense, especially now that we are switching to the numpy-based random generator
Currently, random.random() is called for uniform random numbers, and normal variables are generated via a polar method implementation. Switching these to numpy's equivalent functions should improve performance and have slightly better statistical properties.
Unif: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/polya_gamma.pyx#L24 https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/tilted_stable/tilted_stable.pyx#L45
Polar Normal: https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/polya_gamma/polya_gamma.pyx#L217 https://github.com/aki-nishimura/bayes-bridge/blob/master/bayesbridge/random/tilted_stable/tilted_stable.pyx#L328
Example of cython code that does this: https://github.com/zoj613/polyagamma/blob/main/polyagamma/_polyagamma.pyx