how to deal with transformed variables (off-scale failure during burn-in)

tpapp commented 7 years ago

I have a vector of parameters, some of which are constrained to be ≥0, some ∈[0,1]. I represent the vector in ℝ^n, and transform elements with exp and logistic.

The problem is that when I use the NUTS() sampler, it tries values around, say, 55, which are then transformed into to exp(55)≈1e24. I am testing estimation using simulated data, and the true value should be around exp(-2.3)≈0.1 (transformed). Since my code is not prepared to deal with extreme magnitudes, it fails. Same with values transformed using logistic, my model is not really defined for logistic(Inf), but for practical purposes logistic(40) gets you there.

I am wondering what the standard technique is for somehow telling NUTS to behave during the burn-in and not make crazy steps. I added priors (encoded into the same function as the likelihood—should I do it differently?), but that did not change anything.

If relevant, I can try to whip up a self-contained example, but it would be 200+ LOC.

papamarkou commented 7 years ago

Two versions of NUTS are implemented in the package, without adaptation and with dual averaging (corresponding to algorithms 3 and 6 of the NUTS paper if I remember well on top of my head). This is one example without adaptation and this is with dual averaging. The main difference in the interface is that in the latter case you use the DualAveragingMCTuner() tuner, whose main constructor can be found here; the main input argument represents the "target rate". The wealth of other input arguments of the constructor come from the paper, and have tried to use notation that associates the tuner fields with the paper concepts.

There are various options to tweak NUTS in the implementation, which can be found here. The two main fields that you can use to try and see if it improves your situation are maxδ and maxndoublings (corresponding to the doubling threshold and upper bound of number of doublings, you may have a look at the paper for more background info about these concepts).

At some point, I need to add some more examples or documentation on NUTS. This will have to wait a bit longer, as I want to add the DRAM sampler, some new novel samplers that are being developed by a collaborator, and some population MCMC. Unfortunately, and despite my willingness to help more with specific examples or problems, I don't have the bandwidth of time to look through your specific example, at least not soon. You may post it here if you wish, but I can't make any realistic promises about whether I will be able to look at it in the near future, as this is a bit on the edge between the statistical side of your problem and a bit less related to a programming issue.

tpapp commented 7 years ago

Thanks for the hints. Coming from STAN I expected that the default would be dual averaging, and I did not know you need to enable it explicitly. I don't need to tweak anything in NUTS(), in fact, the following works fine:

outopts = Dict{Symbol, Any}(:monitor => [:value, :logtarget, :gradlogtarget],
                            :diagnostics => [:accept, :ndoublings, :a, :na])
job = BasicMCJob(model,
                 NUTS(),
                 BasicMCRange(nsteps=2000, burnin=1000),
                 inits,
                 tuner = DualAveragingMCTuner(0.651, 1000),
                 outopts = outopts)

and produces nice convergence.

papamarkou commented 7 years ago

Happy to hear this @tpapp.

JuliaStats / Klara.jl

how to deal with transformed variables (off-scale failure during burn-in) #150