bob-carpenter / adaptive-hmc

Self-tuning HMC algorithms and evaluations
MIT License
18 stars 0 forks source link

wrong acceptance probability in u_turn_sampler.py? #2

Closed roualdes closed 7 months ago

roualdes commented 7 months ago

While struggling to develop further on this algorithm, it strikes me that most of the second order estimates for the standard normal model from python/experiments.py are fairly regularly above their known value of 1.

I think the issue is in the acceptance probability, specifically the variable named logp_tune_prop.

Let's call the number of steps in the forward direction before a u-turn $N$. From $N$, we then sample some number of leapfrog steps to define our proposal point, $L \sim U(1, N)$ (inclusive). Starting the uniform distribution at 1 ensures the initial point is excluded from the potential proposal points. The term logp_tune is then $p(L | N) = -\log N$.

From the proposed point we then determine $B$, the number of steps in the backward direction before a u-turn. The acceptance probability should include the probability of $L$ under the density function that depends on $B$.

Right now we have $p(L|B) = -\log B$, but I think this should instead be $p(L | B) = -\log{(B + 1)}$ so as to include the proposal point. Otherwise, aren't the probabilities not balancing correctly?

Here's an attempt at a plot, where $N = 6, L = 4$, and $B = 7$. Effectively, I'm arguing that we should (do the equivalent of) put an orange dot under $x_4$.

acceptance_prob

bob-carpenter commented 7 months ago

Working through your diagram.

Forward: N = 6 is the maximum number of steps forward from x[0] that can be taken without making a U-turn. So we select L ~ uniform(1, 6) steps and get L = 4. The probability of selecting L is 1 / 6.

That's the blue dots x[1] to x[6] with x[4] circled.

Reverse: B = 7 is the max number of steps backward from x[4] that can be taken without makng a U-turn. The probability of selecting the initial point is thus 1 / B = 1 / 7.

That's the orange dots from x[3] backward to x[-3].

I think it's right the way you have it plotted. Specifically, I don't think the point x[4] can be included as a possible proposal point going backward unless x[0] is included going forward. I do think it's be OK to include them both.

roualdes commented 7 months ago

The symmetry argument makes sense, agreed. Something still seems off. Here's what I'm seeing and don't like.

If, on the latest main branch, I run python/experiments.py's test_model(...) function with the std_normal model, with nearly any seed[1], for one million iterations, and print out variances with ddof = 1[2], I get all 20 variance estimates above 1, and specifically near 1.02. I expect roughly half, ~10, to be above 1 and half to be below 1.

Would you mind trying to replicate this? And/or helping me see why my thinking is wrong?

bob-carpenter commented 7 months ago

At various points and configurations of things like including end point or starting point or not and adding 1 or not to the probs, I've either found bias either over- or under-estimating these things. But it seems like it's reversible under this argument. This is why I never trust my own theoretical arguments.

roualdes commented 7 months ago

I'm not denying your experiences, but I personally haven't yet seen underestimates of any moments that didn't end being caused by an eventually obvious error in my code.

This is why I never trust my own theoretical arguments.

Right. And empirically, with my proposed change, second order estimates are much more consistent with the known value of 1 than are second order estimates without my proposed change. As far as I can tell, my proposed change even improves the second order estimates from Nawaf's original code.

Theoretically, despite some issues I have with my plot, I'm stuck on one point. Shouldn't the support of the distribution defined in the reverse direction contain the proposal point? Otherwise, detailed balance couldn't possibly hold, right?

Will you try a few experiments with the standard normal model, if I push a branch containing my proposed changes?

bob-carpenter commented 7 months ago

Sorry---didn't mean to imply that I thought what was there now was working or that I ever had anything that was working. It's just at various points I wound up getting all my estimates bouncing around the true values and then I'd change something for some reason or another and see the bias again. I very likely was just getting lucky when I saw something that balanced.

Shouldn't the support of the distribution defined in the reverse direction contain the proposal point?

If you start at x0 and propose x*, then I think you only need to evaluate q(x* | x0) and balance with q(x0 | x*) for the Hastings condition going the other way, where q is the proposal distribution.

And yes, of course I'm happy to try things. I really want to understand what's going wrong here because the theoretical argument looks right to me.

roualdes commented 7 months ago

Branch alt-accept-prob has an added file python/alt_u_turn_sampler.py and a slightly changed python/experiments.py. Running

$ python experiments.py

will now print results for Stan, UTurnSampler, and AltUTurnSampler for a randomly selected seed and one million iterations.

Let me know what you find/think.

roualdes commented 7 months ago

If you start at x0 and propose x, then I think you only need to evaluate q(x | x0) and balance with q(x0 | x*) for the Hastings condition going the other way, where q is the proposal distribution.

I don't think we can force an exact Hastings condition on this algorithm. We don't have such terms as q(x | x0) and q(x0 | x). There is something analogous to a Hastings condition here, but it's not quite the same. For one thing, the distribution going forward is different than the distribution going backward.

Above, we have written the distribution going forward as a density function $p(L | N)$, representing the Uniform distribution on the integers from 1 to N. Going backward we have $p(L | B)$, representing a different Uniform distribution. Not only is $B$ not necessarily equal to $N$, but the support for $p(L | B)$ is different from the support for $p(L | N)$.

I believe the real question at hand here is, what is the support of $p(L | B)$? I'm proposing $p(L | B)$ be defined on $\{L, L - 1, \ldots, L - B\}$ for a total of $B + 1$ elements. The plot above has the support defined on $\{L - 1, L - 2, \ldots, L - B\}$ with $B$ elements.

Our code does not consider the support of $p(L | B)$.. So $p(L | B)$ can be evaluated on the support consisting of $B$ elements $\{L, \ldots, L - B - 1\}$ without fuss. Mathematically though, we fail to target the posterior distribution of interest.

Imagine writing the code as if we were very careful about the support. Then $p(L | B)$ defined on $\{L - 1, \ldots, L - B\}$ with $B$ elements, as in the plot above, would always return negative infinity (on the log-scale) and we'd never accept any proposals.

bob-carpenter commented 7 months ago

I was thinking of q(x | x) being the uniform distribution defined by the U-turn condition. What is the problem with q(x | x) having different support? Can't we just reject if the starting point x is not in support of p( . | x*)? In the simple case where you don't include the U-turn point, you'll always be able to get back to the starting point with non-zero probability because of how the U-turns are defined.

In HMC, the forward and reverse proposals have support only at a single point, which is different in both directions. It's just exactly reversible, so it doesn't need the correction.

Can't you make a discrete sampler on the integers targeting, for example, a Poisson, by taking the proposal distribution to be the previous proposal, the previous proposal + 1, and the previous proposal - 1 if in support. That's not the same support each way unless you choose +0. Doesn't that define a proper sampler for the Poisson? And can't you do the same thing in a continuous space by taking q(x | x) = uniform(x | x - epsilon, x + epsilon)?

roualdes commented 7 months ago

I was thinking of q(x* | x) being the uniform distribution defined by the U-turn condition.

So there is a distribution associated with the move from x0 to x, but this doesn't factor into the acceptance probability for this algorithm like the terms q(x | x0) and q(x0 | x) do for MH algorithms. Instead we have p(L | N) and p(L | B), which are defined over integer values, but explicitly not defined over the points themselves $x_0, x_L = x^, x_N, x_B$.

I believe the acceptance probability for this algorithm is $\pi(x_L) p(L | B + 1) / \pi(x_0) p(L | N)$.

All of the examples you ask about are valid MH algorithms, as far as I understand them. The point there is that both q(x | x0) > 0 and q(x0 | x) > 0. But there's two issues. 1. We don't really have such terms in this algorithm. 2. If we were to make analogies, within this algorithm if we define $p(L | B)$ on the support $\{L - 1, \ldots, L - B\}$, then $p(L | B) = 0$ (or negative infinity on the log scale). Which is the source of what I believe is making our implementation target the wrong posterior distribution. Effectively, our code should never be accepting any proposal. We just happen to be accepting a bunch of proposals because our code isn't insisting on proper support for our uniform distributions.

bob-carpenter commented 7 months ago

We don't really have such terms in this algorithm. 2. If we were to make analogies, within this algorithm if we define on the support , then (or negative infinity on the log scale).

We can recast the choice of number of steps as a discrete distribution over states. It would be like if we sampled the number of steps L from a uniform distribution in HMC each time, which is valid. It's not a distribution over states directly, but it implies a distribution over states.

I don't see how the support matters. Doesn't the usual case of HMC not have matching support?

bob-carpenter commented 7 months ago

I found the general framework I had set up too complicated to debug, so I just went and implemented from scratch in a single file and it appears to be getting the right answer.

Experiment code: python/new-experiments.py Sampler code: python/turnaround.py

I haven't done formal tests, but here are three runs for a 10-dimensional standard normal.

~/github/bob-carpenter/adaptive-hmc/python (main)$ python3 new-experiments.py
MEAN(param): [ 0.02465748  0.02900559  0.03131233 -0.01678254 -0.01121116 -0.00055195
  0.00816054 -0.06461972 -0.0228056   0.01599141]
MEAN(param^2): [0.84559339 1.04583494 0.92729801 0.93122338 0.90756452 1.04996907
 1.10863685 1.18990176 0.97270069 1.04521289]

~/github/bob-carpenter/adaptive-hmc/python (main)$ python3 new-experiments.py
MEAN(param): [ 0.00458139  0.02088672  0.00860555 -0.01833015  0.02628173  0.01735447
  0.01239474 -0.04963769 -0.0229304   0.01798576]
MEAN(param^2): [0.93566667 1.00658177 1.00119412 0.95120016 0.99224442 0.98281985
 1.08489558 1.06878505 1.00313312 1.03101504]

~/github/bob-carpenter/adaptive-hmc/python (main)$ python3 new-experiments.py
MEAN(param): [-0.00608473  0.01769764  0.02390478  0.01323845  0.00959932  0.00523708
 -0.00475059  0.00253894 -0.02707755 -0.00359373]
MEAN(param^2): [1.02689107 1.01103225 1.01075078 0.98642826 1.00073789 1.01507365
 1.0148896  1.02017467 0.98384247 0.99617342]
roualdes commented 7 months ago

Hey Bob, I do appreciate your patience and willingness to look into this with me. This is an annoying issue, especially since I can't seem to make my math satisfactorily match what I feel runs right in code.

Right off the bat I ran python new-experiments.py with num_draws = 1_000_000 and got ValueError: low >= high. I think this error is happening because we are trying to draw $L$ uniformly from integers in [1, N-1], but numpy's integers function excludes the right boundary so integers(1, N - 1) samples integers from [1, N - 2].

With just the changes above, integers(1, N) and num_draws = 1_000_000, I get mean(param^2) values near 1.09 across all dimensions. I think this is because logp_out and logp_back are backwards. After changing the acceptance probability to have logp_back - logp_out, I'm back to getting mean(param^2) value near 1.02, which is effectively where our old code (and Nawaf's original code) was, before this issue.

I continue to be much happier with the mean(param^2) values if I make the change the value of logp_back to -np.log(M) instead of -np.log(M - 1).

It's on my short todo list to run this algorithm with other models.

bob-carpenter commented 7 months ago

I do appreciate your patience and willingness to look into this with me.

I look at this as you doing me a favor. Always say 'yes' and listen to people who volunteer to do code review!

You're right that I wanted uniform(1, N) rather than uniform(1, N - 1). I keep forgetting that Python excludes the endpoint. I blame too much R! Then, if we are uniformly sampling among N - 1 objects, the transition probabilities should be np.log(N - 1). You were also right that I had flipped the Hastings part of the Metropolis-Hastings condition, which I don't think I can blame on Python.

You are also right that fixing all of these things isn't enough.

I was making the assumption that if you didn't U-turn along a path in one direction you wouldn't U-turn in the other direction. But that's not correct given how I was doing U-turns. Here's an example in 2D:

a. (0, 0)
b. (2, 0)
c. (2, 2)
e. (3, 0)
f. (3, 2)

Outward from a -> b -> c -> e -> f, there's no u-turn: d(a, b) < d(a, c) < d(a, e) < d(a, f)

Backward there's a u-turn at f -> e -> c, because d(f, e) > d(f, c).

So I added a statement to just reject when this happens.

With all of those changes, I get this with step size 0.5:

MEAN(param^2): [0.97994488 0.99946926 0.99930483 0.97484839 0.96989247 1.00271519
 0.97138895 0.99306131 0.98738386 0.99211304]

and this with step size 0.9:

MEAN(param^2): [1.05427756 1.01626507 1.03097905 1.0507222  1.03230128 1.0507011
 1.06055243 1.06892167 1.03898858 1.06376995]

So clearly still not right. I was just getting lucky before with a step size that balanced things out. I pushed the latest version I tried in case you can see anything wrong with it.

The problem now is that the proof of detailed balance looks good to me, but it's still not working.

bob-carpenter commented 7 months ago

Nawaf Bou-Rabee and Milo Marsden gave me some serious tutoring and code review and now everything's working as far as I can tell. I wasn't treating this as a proper conditional and I wasn't considering the momentum (d'oh!). It's still a little shy of NUTS performance in terms of expected square jump distance, partly due to straight-out rejections.

roualdes commented 7 months ago

Nawaf Bou-Rabee and Miles Marsden gave me some serious tutoring and code review

Nice. Wish I could have been there for that.

everything's working as far as I can tell

Agreed.

I wasn't treating this as a proper conditional

This part is clear from the updated code and makes sense.

I wasn't considering the momentum (d'oh!)

Would you help me see where/how momentum considerations are different from before?

bob-carpenter commented 7 months ago

Would you help me see where/how momentum considerations are different from before?

There was a bug in my code where I was only using the log density of the parameters and was forgetting the momentum part of the Hamiltonian. Hence the "D'oh!"