JakobRobnik / MicroCanonicalHMC

MCHMC: sampler from an arbitrary differentiable distribution
GNU General Public License v3.0
68 stars 9 forks source link

poor convergence without tempering #42

Open samueldmcdermott opened 10 months ago

samueldmcdermott commented 10 months ago

A test example of poor convergence for mclmc without tempering is shown in the final section of this notebook. There are posterior samples from 5 different chains shown:

  1. results from dynesty nested sampling (no initialization needed)
  2. results from NUTS/HMC as implemented in numpyro, initialized from final state of the dynesty samples
  3. results from mclmc initialized from final state of the dynesty samples, denoted mchmcd
  4. results from mclmc initialized from final state of the numpyro samples, denoted mchmcn
  5. results from mclmc initialized from all parameters equal 0 (essentially a random point), denoted mchmc0

There's a few things to note:

  1. I'm working from a constrained parameter space and I'd like to have uniform priors on some range. I found that NUTS/HMC was better behaved with "hard" priors where the likelihood went to -inf outside of a given range, but mclmc gave nans with this setup, so I gave it "smooth" priors, which leads to some slight disagreements for parameters that are prior dominated as seen in output of cell 35 in the notebook. (If I have time, I'd like to spend some more time making sure that there are no prior-dominated parameters, but I haven't gotten to this yet, so in the meantime this hard vs soft implementation of priors is why there are some discrepancies on a few posteriors)
  2. The main takeaway from that plot is that the mchmcd and mchmcn posteriors are very similar whereas the mchcm0 posteriors (which are initialized from a bad point) are quite different and spend quite a lot of time far away from the other posteriors (even though I throw out half of the samples as burn-in)
  3. In the following cell you can see that the mchmcd and mchmcn chains explore rather similar log-probability values, differing by DLnP = 15, but the mchmc0 maximum log probability is lower by about 250 (or Dchi^2 = +500 if 2*DLnP is chi^2 distributed). The numpyro and dynesty results are even lower than that in LnP, but because of the different priors I don't think that's a fair comparison.

This doesn't seem to be a bug to me, but @reubenharry suggested I submit an issue since this demonstrates a difference in performance between tempered and non-tempered results. I'm happy to run the mchmc0 chains with different specifications and different approaches to annealing/tempering if it's useful, just let me know