Stateline is a framework for distributed Markov Chain Monte Carlo (MCMC) sampling written in C++. It implements random walk Metropolis-Hastings with parallel tempering to improve chain mixing, provides an adaptive proposal distribution to speed up convergence, and allows the user to factorise their likelihoods (eg. over sensors or data). For a brief introduction to these concepts, see the MCMC Sampling primer below.
Stateline then provides the framework for deploying the distributed MCMC computation on a cluster, while allowing the users to focus on computing the posterior density with their own models, languages and environments. If you are already familiar with parallel tempering and adaptive proposals, you may wish to skip straight to Stateline's features.
MCMC is a widely used algorithm for sampling from a probability distribution given only its unnormalised posterior density. Consequently, MCMC provides a general solution to a wide class of Bayesian inference problems prevalent in scientific research.
An effective strategy to sample in high dimensions is a random-walk Markov chain, which uses the current state of a chain of samples to propose a new state. The proposal is usually a simple distribution g, such as a Gaussian centered on the current state. This allows the algorithm to exploit the structure of the posterior - if we propose a state similar to the last draw, then it is likely to be accepted. With a well designed proposal scheme, an efficient acceptance rate can be achieved even in high dimensions. This comes at the cost of sample correlation, because the draws are no longer independent. To actually achieve independent draws from the posterior requires taking multiple Markov steps. The simplest form of random-walk MCMC is the Metropolis Hastings algorithm, which accepts a step from x to x’ with probability A:
Here, P is the (unnormalised) target density and g is the density of the proposal function for a transition from x to x’. Through detailed balance, the resulting draws have the same equilibrium distribution as the target distribution.
The design objective of a proposal distribution is to maximise the number of proposals required per independent sample from the posterior. A conservative proposal that takes small steps may have a high acceptance rate, but will produce a highly autocorrelated chain. Conversely, a broad proposal that attempts large steps will suffer from a low acceptance rate. Studies in the literature have shown that a convergence 'sweet spot' exists, for example an accept rate of 0.234 is optimal if the likelihood is a separable function of each component of a high dimensional parameter space θ Roberts97.
Because the optimal proposal distribution depends strongly on the target posterior, it is important for an MCMC tool to provide an automatic adaption mechanism. Care must be taken because naive adaptation of the proposal during sampling may invalidate the ergodicity of the MCMC chain, causing the resulting samples to be drawn from a different distribution. The simplest sufficient condition to guarantee correctness is the diminishing adaptation principle Roberts06. Stateline learns a proposal scale factor that that, together with the empirical covariance of the samples, forms a diminishing adaptation proposal function Andrieu2008.
Even with a well tuned proposal distribution, MCMC chains tend to become trapped in local modes of the target distribution. A distribution that highlights this problem is the double-well potential below. A proposal that can jump between the modes in one step is too coarse to effectively explore the individual modes, while the probability of accepting an intermediate state between them is almost zero. Thus a standard MCMC approach would explore a single mode and have no knowledge of the other:
Figure: (Left) A double-potential well target distribution is a challenging problem for MCMC sampling. (Right) The same distribution raised to the ⅛ power is much easier to sample because the modes are bridged.
On the other hand, the modes on the right distribution are bridged so a markov chain would be able to mix between them effectively. Parallel tempering (Metropolis-Coupled Markov-Chain Monte-Carlo) exploits this concept by constructing multiple chains, one exploring the base distribution, and the others at increasingly higher ‘temperatures’. The high temperature chains see a distribution equivalent to the original density raised to a fractional power, bringing the distribution closer to uniform (the distribution on the right had a temperature of 8 times the one on the left). The acceptance criterion is therefore modified to take into account temperature T:Now the high temperature chains exchange information by proposing exchanges of state (the exchanges are themselves a markov chain), with the probability given by:
Here Ti is the temperature of the i’th temperature chain, 𝛷 is the target density and θi is the state of chain i. ##### Convergence Heuristics Chain convergence can be inferred by independently running multiple MCMC chains (stacks) and comparing their statistical measures. If the chains are exploring a different set of modes, this can be detected. Otherwise we must assume they are adequately mixing, although there is a possibility that all the chains have failed to discover a mode (parallel tempering reduces the probability of this happening). Stateline employs the approach of [Brooks98](#references). ###Why Stateline Stateline is designed specifically for difficult inference problems in computational science. We assume that a target distribution may be highly non-Gaussian, that the data we are conditioning on is highly non-linearly related to the model parameters, and that the observation models might be expensive ‘black box’ functions such as the solutions to numerical simulations. Numerous innovative technical capabilities have been incorporated into the Stateline codebase, specifically to improve usability and functionality in scientific applications: ##### Cluster Computing Architecture Stateline provides an architecture to deploy the MCMC model evaluations onto a cluster environment. The key architecture is shown below:
Figure: The server/worker architecture used by Stateline. The server drives multiple MCMC chains using the parallel tempering algorithm. The components of the likelihood evaluations are submitted into a job queue, where a scheduler farms them out to worker nodes on a cluster for computation. The server asynchronously collates the likelihoods and advances each chain when ready.
Distributing computation onto a cluster is critical for problems that have expensive forward models, or require many samples for convergence. Support is provided for cluster deployments, particularly on Amazon Web Services and Docker. See [Cluster Deployment](#cluster-deployment) for further details. ##### Heterogeneous Likelihood Computations The Stateline user is responsible for writing code to evaluate the posterior density of the target distribution at a given input. We refer to this code as the likelihood function. We take steps to make this process as flexible as possible: * Likelihoods are black-box function calls - users can plug in arbitrarily complex scientific simulations implemented in their chosen languages. They can even call libraries or command line tools. The workers simply communicate the results to stateline using [ZeroMQ](http://www.zeromq.org), a widely available, light-weight communication library (see [Other Languages](#other-languages)). * Likelihoods can be factorised into components using the [job-type specification](#tips-and-tricks). For example, if two sensors independently measure a process, then each sensor forward model can be computed in parallel on a different worker process. ##### Feature-rich Sampling Engine Stateline employs a number of techniques to improve mixing and allow chains to explore multi-modal posteriors: * Multiple independent stacks are computed in parallel for convergence monitoring (the number is selected by the user) * Parallel tempering is implemented with an asynchronous algorithm to propagate swaps between pairs of chains without stalling the others. * The chains are grouped into tiers with shared parameters, so the nth hottest chains in each stack will share a common temperature and proposal length. * Adaptive proposals are implemented (on a per-tier basis) based on Algorithm-4 from [Andrieu08](#references). We use a Gaussian proposal with covariance equal to a scale factor times the empirical variance of the samples. * The proposal scale factor is adapted using a novel regression model to learn the relationship between the accept rate and proposal width. This model is incrementally updated with each proposal, and exhibits diminishing adaptation as its training dataset grows. * The chain temperatures of each tier are also adapted using an on-line regression approach to target a desired swap rate. ##### Ease of Use The Stateline server is relatively simple to operate: * Users only need to understand the high level concepts of how parallel tempering works. A lightweight [Stateline Configuration File](configuration) in JSON format is used to configure a set of algorithm parameters. * Stateline provides a natural way to spin up a server, and an arbitrary number of workers on the Amazon Web Services Elastic Compute Cloud (EC2). An example using the Clusterous tool is provided (See [Cluster Deployment](#cluster-deployment)). Detailed logging is available (even when the system is deployed on a cluster): * console output displays a table showing the chain energies, stacks, accept and swap rates * a multiple sequence diagnostic is used to detect convergence - the independent stacks are analysed with a neccessary (but not sufficient) condition to ensure convergence [Brooks 1998](#references). Finally, Stateline's [output](#mcmc-output) is provided in csv format, so it is simple to load and analyse. The output is written in intermediate steps in case of early termination. ##System Requirements Stateline has been sucsessfully compiled on Linux and OSX machines. We don't currently support Windows. For large-scale deployments, we recommend using Docker (and the dockerfile included in this repo). To build stateline, you will need the following: * GCC 4.8.2/Clang 3.6.0+ * CMake 3.0+ You will need install the following libraries through your operating system's package manager: * Boost 1.58+ * Eigen 3.2.0+ * google-test 1.7.0+ * zeromq 4.0+ To run the python demos, you will also need: * Python 2.7/3.4+ * Pyzmq * numpy * corner-plot (python library) ##Installation First clone the repository and create a directory in which to build it: ```bash $ git clone https://github.com/NICTA/stateline.git $ mkdir stateline-build $ cd stateline-build ``` Note that it is perfectly okay to make the build directory inside the `stateline` repository, like so: ```bash $ mkdir stateline/build $ cd stateline/build ``` To build Stateline from your build directory, call `cmake` and then `make`: ```bash $ cmake ../stateline $ make ``` If your build directory was `stateline/build`, then the first line above is simply `cmake ..`. You can set variables to control the build configuration here. If CMake has trouble finding a dependency, then you can help out by specifying it's location. For example, you can specify the location of the Google Test sources by changing the first line above to `cmake ../stateline -DGTEST_ROOT=/opt/actual/location/of/gtest-1.7.0`. You can specify the build type by giving the `-DCMAKE_BUILD_TYPE=
Figure: Visualisation of the histograms of samples from the Stateline demo, showing the joint distribution and marginals of the first two dimensions.
Viewing the raw histograms of the parameters is informative for a low dimensional problem like this demo. ##Cluster Deployment Stateline is designed to take advantage of many computers performing likelihood evaluations in parallel. The idea is to run a server on a single machine and many workers communicating with the server over TCP. Workers can be ephemeral -- if a worker dissapears mid-job that job will be reassinged to another worker by the server (after a few seconds). At the moment the server does not support recovering from early termination, so place it on a reliable machine if possible. The server also needs at least 2 cores to work effectively, so provision it with decent hardware. The default port stateline uses is 5555, but this can be changed with the `-p` argument to the stateline server. There is a Dockerfile ready to go which has both the server and the worker built. Feel free to use this as a base image when deploying your code. ##Tips and Tricks This section addresses some common questions about configuring and using Stateline for a scientific problem: ##### How do I burn-in? Stateline does not manage sample burn-in and begins recording samples from the initial state onwards. Burn-in is basically a method of bringing the MCMC chains to a plausible initial state. With infinite samples, it shouldn't matter what state the MCMC chains are initialised in. However, with a finite number of samples, we can improve our chances of achieving convergence by starting in a likely state. We suggest two strategies for burn-in: * a draw from the posterior is likely to be a good initial state, so one option is to run MCMC on the distribution for a sufficiently long time and discard some initial samples. Stateline will record all the samples, so the number to discard can be determined afterwards by looking at the time evolution of the marginals, for example. * Stateline can optionally have the initial state specified in the config file. This initial state can be found effectively by applying a numerical optimisation library to the worker likelihood code, repurposing it as an optimisation criterion. This will start the chains in a mode of the distribution, and has the added benefit that the initial state can be introspected prior to any MCMC sampling. ##### Job-Types Stateline can be configured to use job types. For each likelihood evaluation, Stateline farms out one evaluation for each job type, providing the worker with the parameters and job type index. The actual meaning of job-type is left up to the user who has written the worker code. We recommend two types of job-type factorisation: * Factorising over data - in many cases the likelihood of sub-data is independent given the model parameters. In this case we can automatically use the job-type integer as a partition index into the full dataset and workers can compute the parts in parallel. * Factorising over sensors - often different data modalities are used, and different forward models apply relating the observations to the latent parameters. If these sensors are independent given the latent parameters, then job-type can be used to specify which modality's likelihood to compute. ##### How do I achieve uncorrelated samples? In order to achieve an efficient accept rate, an MCMC chain is neccessarily auto-correlated. The best way to achieve uncorrelated samples is to compute the chain auto-correlation post-sampling. Then, samples can be discarded keeping only one per auto-correlation length. ##### What swap interval should I use? Any analysis of chain auto-correlation should be used to tune the swap rate of the parallel tempering. Ideally the swap rate should be equal to the autocorrelation length of the chain. This number is typically 10-50 but may vary strongly depending on the target distribution. ##### What accept rates and swap rates should I target? Studies in the literature have shown that 0.234 is a safe target accept rate - this rate is optimal if the likelihood is a separable function of each component of a high dimensional parameter space θ [Roberts97](#references). The optimum is understood to be higher for low dimensional problems, and may be distorted by non seperable posterior structure. In practise, the effectiveness of MCMC is insensitive to the exact accept rate provided it is kept away from 0. and 1., the non-informative conditions where the chain stalls through no accepts and no proposal perturbations respectively. The optimal swap rate between chains is less , but the same general rules as above apply. ##### Why arent my chains achieving the desired accept rate? When the adaption fails to achieve the desired accept rate after a moderate number of samples (say 10,000), it is important to look at the accept rate in conjunction with sigma and the current energy to understand why. There are two typical failures. Firstly, if sigma is small and the accept rate is still very low it suggests there is a problem with the likelihood function or the scale of the inputs that needs to be addressed and debugged by the user. This can sometimes occur if inappropriately wide bounds are provided, because the shape of the initial proposal covariance is based off the range of the bounds. It can also occur if the posterior is extremely peaky and initialised on the peak. In these cases, it may be neccessary to actually apply a perturbation to the initial condition. On the other hand, if sigma is very large and the accept rate is still high, as seen in chains 4 and 9 of the example, this suggests that the high temperature distribution is becoming uniform, and can form a criterion for selecting the number of temperature tiers (see below). ##### How many temperature tiers should I use? If a high temperature chain has a large sigma and a higher-than-targeted accept rate, as seen in chains 4 and 9 of the example logging, this suggests that the high temperature distribution is becoming uniform. The proposal is using the `bouncy bounds' to essentially draw indepenent random samples from the input space, and they are still geting accepted. This is not a problem, but does suggest there will be little further benefit in adding additional temperature tiers. After the betas have adapted, you want the tiers to span all the way from the true distribution (Beta=1) to a uniform distribution (Beta -> 0). Thus, we want the hottest tier to exhibit the high and high accept rate condition, while the others form a progressive ladder with active swapping. ##### How many stacks should I use? At least two stacks are required to run convergence heuristics. The heuristics become more reliable with more independent stacks. Adding more stacks can employ more workers at a time and trivially increase the samples per second regardless of the minimum time needed to evaluate a likelihood function. However, the stacks will need to burn-in and converge independently, so the minimum number of samples needed will increase proporitionately and the total run-time of the MCMC won't decrease as workers and stacks are added. So again, the correct choice depends on the users needs. In general, use many stacks if you want the most samples per second, and use 2 stacks if you want the best value as it will use the minimum number of samples to converge. ##### How many workers? Stateline can use at most number of stacks * number of temperatures * number of job-types workers. In practise, this gives the fastest output but won't fully utilise all the workers. It can be more energy/money/computer efficient to run more than one worker per core, up to the users discretion. ##### How do I analyse the results? We have provided example code for plotting the density of pairs of dimensions and marginals. This is appropriate for simpler low dimensional distributions where the parameters are interpretable. However, it will often be the case that the parameters are high dimensional and correspond to inputs to a complex model. We recommend re-using the same worker code to run models on the sampled parameters. This enables marginalisation of derived properties of the model outputs with respect to the parameters. ##Workers in Other Languages Creating in a worker in a language other than C++ should be fairly simple as long as that library has access to ZeroMQ bindings. For the impatient, the approach is the same as the Python example given above. The way other language bindings work is to run a copy of `stateline-client` for every worker, then each worker communicates with its stateline-client via a local unix socket using ZeroMQ. This means all the complex logic for handling job requests, server heartbeating and asynchronous messages are invisible, leaving only a very simple loop. In pseudocode: ``` start a stateline-client send 'hello' message to stateline-client while working: receive a job from stateline-client calculate a likelihood send the likelihood to stateline-client send 'goodbye' message to stateline-client ``` ###stateline-client The `stateline-client` binds (in the ZeroMQ sense) to the socket given in its argument. This socket cannot already exist. For example: ```bash $ ./stateline-client -w ipc:///tmp/my_socket.sock ``` binds the stateline-client to `/tmp/my_socket.sock`. The general form is `ipc://