bob-carpenter / adaptive-hmc

Self-tuning HMC algorithms and evaluations
MIT License
18 stars 0 forks source link

Progressive Turnaround Sampler #5

Open roualdes opened 7 months ago

roualdes commented 7 months ago

The file python/progressive_turnaround.py contains an implementation of progressive sampling a proposal point during trajectory exploration, thus reducing the number of leapfrog steps necessary to ensure detailed balance. The idea is similar to, but distinct from, what Stan's HMC does. The details of Stan's HMC, and hence the name of this sampling scheme, come from Michael Betancourt's A Conceptual Introduction to Hamiltonian Monte Carlo[1], specifically Appendix A.3.1. See also Radford Neal's An Improved Acceptance Procedure for the Hybrid Monte Carlo Algorithm.

Let $N$ define the number of steps beyond which further leapfrog steps in the forward direction will bring us closer to $\theta_0$. The progressive sampling proposal distribution selects a point along the trajectory of leapfrog steps $1, \ldots, N$ with probabilities

$$\frac{p(\theta_i, \rhoi)}{\sum{n \leq N} p(\theta_n, \rho_n)}$$

for $i \in \{1, \ldots, N\}$. Define $M$ similarly, but in the backwards direction and starting from the proposal point $\theta^*$.

The acceptance criterion of AHMC, as defined in our shared Overleaf file, is in terms of the tuning parameters. I find it more natural to phrase the acceptance criterion for progressive sampling in terms of the state $(\theta, \rho)$. So

$$\frac{p(\theta^*, \rho^*) \cdot p(\epsilon^*, L^* | \theta^*, \rho^*)}{p(\theta_0, \rho_0) \cdot p(\epsilon^*, L^* | \theta_0, \rho_0)}$$

becomes

$$\frac{p(\theta^*, \rho^*) \cdot \frac{p(\theta_0, \rho0)}{\sum{m \leq M} p(\theta_m, \rho_m)}}{p(\theta_0, \rho0) \cdot \frac{p(\theta^*, \rho^*)}{\sum{n \leq N} p(\theta_n, \rho_n)}}$$

The numerator accounts for $L^*$ steps backwards from $(\theta^*, \rho^*)$, hence $(\theta_0, \rho_0)$. The denominator accounts for $L^*$ steps forwards from $(\theta_0, \rho_0)$.

Simplifying the acceptance criterion gives the result

$$\frac{\sum_{n\leq N} p(\theta_n, \rhon)}{\sum{m\leq M} p(\theta_m, \rho_m)}$$

which is the ratio of the sum of the densities along the forwards and backwards trajectories.

bob-carpenter commented 7 months ago

Cool! Thanks for adding. I've been meaning to try this after @MiloMarsden suggested this approach. I don't think he'd worked it through to the simplified acceptance criterion, which is really nice as we can do that using a log-sum-exp accumulator if necessary.

I was really hoping to do something more like NUTS that did a multinomial selection proportional to weights. This is still making a single proposal then accepting or rejecting. My guess is that the acceptance criterion is going to be similar to what we are currently doing in that it's going to be approximately N/M assuming the Hamiltonian is maintained, which I think is the same thing as for the uniform sampler with perfect Hamiltonian simulation.

roualdes commented 7 months ago

we can do that using a log-sum-exp accumulator

Correct, and this is what happens in the code.

like NUTS

It's my understanding, but I would be delighted to be shown otherwise, that my code is as close to what Stan's HMC does as we can be, unless we add in something analogous to Stan's sub-trees. The source of the difference is the fact that Stan uses sub-trees, and we don't currently have an obvious sub-tree.

In the end, the code here and the ideas aren't terribly hard to grasp. What took me the longest in putting this together was understanding Betancourt's Appendix A.3.2 of A Conceptual Introduction to Hamiltonian Monte Carlo.

As I understand it, Stan makes proposals from within sub-trees using multinomial probabilities (defined as I've done here), but samples sub-trees themselves using something Betancourt calls biased probabilities. If a sub-tree is accepted, the point from within the sub-tree is the proposal.

My guess is that the acceptance criterion is going to be similar to what we are currently doing in that it's going to be approximately N/M assuming the Hamiltonian is maintained, which I think is the same thing as for the uniform sampler with perfect Hamiltonian simulation.

Right, if the Hamiltonian simulation were perfect, then the multinomial probabilities would be uniform probabilities. Since they're not perfectly uniform probabilities in practice, we use log-sum-exp of Hamiltonians along a trajectory as probabilities (instead of N/M).

bob-carpenter commented 7 months ago

Stan makes proposals from within sub-trees using multinomial probabilities (defined as I've done here), but samples sub-trees themselves using something Betancourt calls biased probabilities. If a sub-tree is accepted, the point from within the sub-tree is the proposal.

That's right. It' the same in the original algorithm. It shows up in a detail where you select the subtree proportional to its size, which biases toward selecting from later subtrees.

unless we add in something analogous to Stan's sub-trees

The subtree construction is to ensure reversibility of the NUTS proposal. I hadn't realized until the recent project how you can get a U-turn from A at B, but the U-turn going backward from B can be sooner than A.

P.S. I was blown away to see that the same derivation you show above, which was new to me, shows up in a couple of steps starting at equation (6) of https://www.tandfonline.com/doi/pdf/10.1080/10618600.2023.2190784