Performance improvements

mberaha commented 3 years ago

I've been profiling our code with callgrind, here's the main takeaway

We are slow at evaluating the density given the MCMC chain
We are slow at evaluating the multivariate student t

In particular, here's some timings:

31.3 % of the time is spent in NNWHierarchy::marg_lpdf
20% on NNWHierarchy::set_state_from_proto
20% on performing 'free' (this is a strange one)

Apart from the 'free' one, the others are easy to understand: marg_lpdf calls stan::math::multi_student_t_lpdf where the evaluation of (x-mu) \Sigma^{-1} (x-mu) and of det(\Sigma) are required, much as in the multivariate normal case. To this end, we should work with cholesky factors as we do with the multivariate normal, a prototype for this function is already implemented. When using NNW hierarchies, we can also work with cholesky factors at the prior level (instead of storing scale, we should store scale_chol, see the test in test/distributions.cc), this should greatly reduce the computational cost

NNWHierarchy::set_state_from_proto also requires a cholesky factorization, which could be avoided if instead of storing the precision matrix, we store its cholesky factor (or both). I would just store the cholesky factor, user then need to perform matrix multiplication to retrieve the entire precision matrix.

We should make use of vectorization when evaluation the _lpdf on a grid (for instance by using Eigen's .unaryExpr, or stan's map_rect, which I don't fully understand but should also work in parallel) and parallelize as much as possible the density evaluations.

Finally, I was also thinking about moving burnin, maxiter outside of the algorithm and have a run(int burnin, int maxiter, int thin) which also introduces thinning, so that we can evaluate fewer densities than the number of iterations.

brunoguindani commented 3 years ago

20% on performing 'free' (this is a strange one)

Yeah, I remember this was also the case in the days of old, when we did profiling for the PACS report. It was mainly called by stan::return_type< Eigen::MatrixXd ... >, as I recall. I thought it was type conversions due to the templatized return types in Stan or something.

Apart from the 'free' one, the others are easy to understand: marg_lpdf calls stan::math::multi_student_t_lpdf where the evaluation of (x-mu) \Sigma^{-1} (x-mu) and of det(\Sigma) are required, much as in the multivariate normal case. To this end, we should work with cholesky factors as we do with the multivariate normal, a prototype for this function is already implemented. When using NNW hierarchies, we can also work with cholesky factors at the prior level (instead of storing scale, we should store scale_chol, see the test in test/distributions.cc), this should greatly reduce the computational cost

NNWHierarchy::set_state_from_proto also requires a cholesky factorization, which could be avoided if instead of storing the precision matrix, we store its cholesky factor (or both). I would just store the cholesky factor, user then need to perform matrix multiplication to retrieve the entire precision matrix.

Sure, it's an excellent idea! We already use lots of structs that hold optimization-related factors anyway.

We should make use of vectorization when evaluation the _lpdf on a grid (for instance by using Eigen's .unaryExpr, or stan's map_rect, which I don't fully understand but should also work in parallel) and parallelize as much as possible the density evaluations.

Finally, I was also thinking about moving burnin, maxiter outside of the algorithm and have a run(int burnin, int maxiter, int thin) which also introduces thinning, so that we can evaluate fewer densities than the number of iterations.

These are also good ideas, especially the thinning option, which I understand as being standard for MCMC software. Maybe with sensible default values -- 100 / 1000 / 2? Is it too small of a sample size? But we should definitely take advantage of the fact that using thousands of iterations hardly brings any improvement with respect to even a few hundred.

mberaha commented 3 years ago

So things are really starting to move pretty quickly. With respect to the master branch, we have the following benchmark differences

BM_Neal2/1 -0.0562
BM_Neal2/2 -0.1705
BM_Neal2/4 -0.0288
BM_Neal2/8 -0.1088
BM_Neal3/1 -0.0574
BM_Neal3/2 -0.1239
BM_Neal3/4 -0.1821
BM_Neal3/8 -0.1946
BM_eval_lpdf_memory_read/1 -0.3055
BM_eval_lpdf_memory_read/2 -0.4856
BM_eval_lpdf_memory_noread/1 -0.3247
BM_eval_lpdf_memory_noread/2 -0.4891

Where the differences are (old time - new time) / old time. They are pretty big to me The differences on the runs are due to the use of cholesky factors, while the ones on eval_lpdf are due to changing the loop order: from iteration->grid points -> clusters to iteration->clusters->grid points In the latter we also use the eval_lpdf_grid methods that can be further optimized IMO

brunoguindani commented 3 years ago

These are some pretty neat improvements indeed!

In the latter we also use the eval_lpdf_grid methods that can be further optimized IMO

Do you mean parallelization, or do you have something else in mind?

mberaha commented 3 years ago

In the latter we also use the eval_lpdf_grid methods that can be further optimized IMO

Do you mean parallelization, or do you have something else in mind?

I would try as hard as possible to parallelize over the MCMC iterations, because that will provide (I believe) linear speedup in the number of cores.

For the lpdf_grids, we could do something smarter for instance by using eigen's optimized methods (e.g. array.unaryExpr()) instead of manually looping through the matrix, and / or save some computations. For instance when we call 'like_lpdf_grid' from NNWHierarchy, the only thing that changes in the evaluations is the term (datum-mean), while we also recompute prec_logdet + NEG_LOG_SQRT_TWO_PI * datum.size(). These might look small stuff but for a grid of 100k elements (not so surprising in 2D) might make a substantial difference, all else being equal

brunoguindani commented 3 years ago

Right right. Sounds feasible.

mberaha commented 3 years ago

Ok so here it goes, we gain a stable 20% over Neal2 and Neal3. Moreover, the new optimized multivariate normal and multivariate t over grids + switching the for loops give us a 15x speedup in eval_dens.

I guess this is enough for a PR.

About the parallel density evaluation, I'm still not sure what is the best way to proceed, especially to deal also with the FileCollector. Probably I would read a batch of states and then process them in parallel. However this is kind of a lot of code and maybe should deserve a different class (DensityEvaluator ?) or utility function.

brunoguindani commented 3 years ago

Let's discuss about an evaluator class. In principle I would agree because an Algorithm should only focus on the MCMC chain. However, this entails creating a structure of evaluators which kind of mirrors the Algorithm inheritance tree, since marginal and conditional estimates are both different. Some algorithms may even have their own quirks, e.g. Neal8's marginal component. This would be both pointless and difficult to maintain. What do you think?

By the way, if we do do an evaluator class, we should definitely also move clustering estimation there, or at least implement both of them as derivate classes of a base estimator. Most of the parallelizing functions would be shared between both anyway.

About parallelism, I really don't know much about omp. But what I would do is transorm curr_state into a vector which holds m states at a time, with m being the number of cores. We loop next_state() m times to fill the vector, then we pass each state to a different core. Then rinse and repeat. I guess this is not full parallelization since we still need to wait for the slowest core in order to read the next batch, but there's not much we can do with an inherently sequential structure like the collector.

mberaha commented 3 years ago

One possibility would be that the algorithms implement a Eigen::VectorXd lpdf_from_state(Eigen::MatrixXd grid, bayesmix::AlgorithmState), then the DensityEvaluator could store a pointer to the algorithm. In its simplest form the density evaluator loops over the mcmc chains and calls algo->lpdf_from_state(...), possibly in parallel.

How to achieve the best parallelism, I'm still not sure. Ideally we want something asynchronous (i.e. the first thread that is free reads the next chunk, when the other threads finish what they are doing they take up some workload), but I barely know the basics of async programming in python, and C++ seems much more complex.

brunoguindani commented 3 years ago

One possibility would be that the algorithms implement a Eigen::VectorXd lpdf_from_state(Eigen::MatrixXd grid, bayesmix::AlgorithmState), then the DensityEvaluator could store a pointer to the algorithm. In its simplest form the density evaluator loops over the mcmc chains and calls algo->lpdf_from_state(...), possibly in parallel.

I like this idea, especially since it should work with a shallow, uninitialized copy of an Algorithm. It's a bit convoluted, but it simplifies the library structure (as in it does not introduce additional complexity) so I'm all over it.

How to achieve the best parallelism, I'm still not sure. Ideally we want something asynchronous (i.e. the first thread that is free reads the next chunk, when the other threads finish what they are doing they take up some workload), but I barely know the basics of async programming in python, and C++ seems much more complex.

We sure should work asynchronously. We should ask someone that knows.

As it stands, I think this PR may be merged whenever you wish.

mberaha commented 3 years ago

Feel free to bring onboard any asynchronous friend of yours...

bayesmix-dev / bayesmix

Performance improvements #68