restricting the support of parameters

tpapp commented 7 years ago

(Since the documentation does not mention a mailing list, I hope it is OK to ask questions here, if not, please close and redirect me to the appropriate forum).

I am trying to understand how to use Klara from the examples. They seem pretty clear, except for the following: suppose some parameter has restricted support, eg for a standard deviation σ ≥ 0, or a structural parameter is known to be β ∈ [0,1]. BasicContUnvParameter has continuous support on the real line. So should I use a transformed parameter, eg β_trans ∈ ℝ in setting up the model, and then transform into β, eg using logistic, for calculating the likelihood? Or are there facilities in Klara to do this automatically?

papamarkou commented 7 years ago

Hi @tpapp, thanks for reaching out. You are right that I haven't added an automatic way of restricting the parameter range. There are several ways to go about it.

One way would be to define the proposal distribution of Metropolis-Hastings to be a truncated normal, see for instance the following example for sampling from a Gamma distribution:

https://github.com/scidom/StatsLearningByExample.jl/blob/master/06-MCMC/06-02-Example-Gamma.ipynb

A second way would be to assume that your data follow log-normal. You would then need to take the logarithm of them and use their log-transformed version in the MCMC simulation (thus assuming that the log-transformed data follow normal), and at the end you would exponentiate the resulting chain.

The two above suggestions entail restrictions. The former is implemented for Metropolis-Hastings, while the latter assumes a normal prior. The third and more generic suggestion is the one you made. If your parameter β ∈ [0,1], then define the logit-transformed variable

c=logit(β)=log(β/(1-β)),

run your MCMC simulations using c, and at the end take the inverse logit transform (logistic function) of the simulated chain, that is compute

β=logit^{-1}(c)=1/(1+exp(-c))=exp(c)/(exp(c)+1)

tpapp commented 7 years ago

@scidom, thanks for the quick reply. I got my example working with MH, but having trouble with NUTS, specifically autodiff. Here is a MWE:

using Klara # latest master
# fitting mean of a normal using sufficient statistics
y = randn(100)*2                # μ=0 (known), σ=2 (will be estimated)

function σ_log_logtarget(σ_log::Real, v::Vector)
    (ss,N) = v
    σ = exp(σ_log)
    -ss/(2*σ^2)-N*σ_log
end
σ = BasicContUnvParameter(:σ_log, logtarget = σ_log_logtarget, nkeys=2,
                          autodiff = :forward)
model = likelihood_model([Constant(:ss), Constant(:N), σ]; isindexed = false)
inits = Dict(:σ_log => 1.0, :ss => sum(y^2 for y in y), :N => length(y))
job = BasicMCJob(model, NUTS(), BasicMCRange(nsteps=2000, burnin=1000), inits)

which fails with

ERROR: MethodError: no method matching derivative(::DiffBase.DiffResult{1,Float64,Tuple{Float64}}, ::Klara.###internal_forward_autodiff_closure#279, ::Float64)

I don't understand the following:

in a logtarget, what is the role of the second parameter? do I use it to pass data? or is it assembled from the other parameters somehow?
what is nkeys, and how should I use it?

Finally, an unrelated question: how to extract actual samples from the posterior?

papamarkou commented 7 years ago

Hi, @tpapp, I will check this out when I will be back to my computer, if not tonight, then tomorrow.

papamarkou commented 7 years ago

I see what happens @tpapp. The error is not specific to your example or NUTS, but instead arises from a break of interface in the ForwardDiff package. I will try to upgrade Klara to catch up with the ForwardDiff changes before the end of the weekend.

papamarkou commented 7 years ago

I pushed the changes to Klara's master @tpapp. If you Pkg.checkout("Klara"), things should work out now. Moreover, use the stable Julia realease (v0.5.0), not the bleeding edge version from nigthly, because the latter has some type inference regressions pending that can affect Klara.

Here is the running version of what you are trying to do:

using Klara

y = randn(100)*2 # μ=0 (known), σ=2 (will be estimated)

σ_log_logtarget(σ_log::Real, v::Vector) = -v[1]/(2*exp(σ_log))-v[2]*σ_log

σ = BasicContUnvParameter(:σ_log, logtarget=σ_log_logtarget, nkeys=3, autodiff=:forward)

model = likelihood_model([Constant(:ss), Constant(:N), σ]; isindexed=false)

inits = Dict(:σ_log => 1.0, :ss => sum(y^2 for y in y), :N => length(y))

job = BasicMCJob(model, NUTS(), BasicMCRange(nsteps=2000, burnin=1000), inits)

run(job)

chain = output(job)

mean(chain)

chain.value

You are right, the second argument is used if you want to pass extra arguments, which can include data or other parameters or hyper-parameters. For coherence, this second vector does contain all possible variables, including the one from the first argument. For background information, one might wonder why then not have just one single vector input argument (i.e. keep only the second vector argument to the target) - the reason for this relates to compatibility with the closure that ForwardDiff requires as input. Not to tire you with low-level details, all you need to do is to ensure that this second input argument to your logtarget is a vector with the same order of arguments as the order with which you specify your components in the likelihood, that is at the definition likelihood_model([Constant(:ss), Constant(:N), σ]; isindexed = false).

nkeys is just the total number of variables involved, including data, parameters, constants (in your case this is 3).

papamarkou commented 7 years ago

P.S. Once you tell me that you are happy with the above and it runs, I will close this issue.

tpapp commented 7 years ago

@scidom: Thanks for the fix and the explanation, I really appreciate your help. The code above is missing the square in the denominator and the Jacobian correction, so for future readers, I post the MWE below. I am closing the issue and will ask other questions in new issues.

using Klara

y = randn(100)*2 # μ=0 (known), σ=2 (will be estimated)

function σ_log_logtarget(σ_log::Real, v::Vector)
    (ss, N) = v
    -ss/(2*exp(2*σ_log))-N*σ_log+σ_log
end
σ = BasicContUnvParameter(:σ_log, logtarget=σ_log_logtarget, nkeys=3,
                          autodiff=:forward)
model = likelihood_model([Constant(:ss), Constant(:N), σ]; isindexed=false)
inits = Dict(:σ_log => 1.0, :ss => sum(y^2 for y in y), :N => length(y))
job = BasicMCJob(model, NUTS(), BasicMCRange(nsteps=2000, burnin=1000), inits)

run(job)
chain = output(job)
σ_posterior = exp.(chain.value)
mean(σ_posterior)
std(σ_posterior)
ess(chain)

papamarkou commented 7 years ago

You're welcome @tpapp, and thank you for providing the fully functional example. Your code works perfectly fine, I will make only one conceptual remark. Since v is a vector holding 3 elements, namely the states for :ss, :N and :σ_log (in that order, which is the same order in which they appear in the input argument [Constant(:ss), Constant(:N), σ] of likelihood_model()), it would then make sense to define (ss, N) = v[1:2]. This is only a clarification. Otherwise, your statement (ss, N) = v is legitimate Julia syntax and achieves the exact same result.

tpapp commented 7 years ago

@scidom: Thanks for the correction. I have to admit that I am confused by the design decisions for v, and why it holds both the parameter and the data — possibly because they are all nodes? In any case, I found that I can use closures to pass in data, which allows me to use constructs that don't map well to Vector, such as composite types and Dicts. Here is an MWE using Dict:

using Klara

srand(42)
true_θ = 0.3
y = rand(Float64, 100) .≤ true_θ
counts = Dict(z => count(x -> x == z, y) for z in [0,1])

"Log posterior in terms of θ_raw ∈ ℝ, which is mapped to θ ∈ (0,1)."
function θ_raw_logtarget(θ_raw, counts)
    z = exp(-θ_raw)
    θ = 1/(1+z)
    jac = θ*z/(1+z)
    counts[0]*log(1-θ) + counts[1]*log(θ) + jac
end

θ = BasicContUnvParameter(:θ_raw,
                          logtarget=θ_raw->θ_raw_logtarget(θ_raw, counts),
                          autodiff=:forward)
model = likelihood_model([θ]; isindexed=false)
inits = Dict(:θ_raw => 0.0)
job = BasicMCJob(model, NUTS(), BasicMCRange(nsteps=2000, burnin=1000), inits)

run(job)
chain = output(job)
ess(chain)
θ_posterior = logistic.(chain.value)
mean(θ_posterior)

Of course this I could do using Vector. But in my actual problem, I am counting paths for a Hidden Markov Chain, so Dict is more natural. I find mapping things to and from v a bit error prone.

papamarkou commented 7 years ago

@tpapp, exactly, the motivation is that all variable types, which includes parameters, constants and data, are regarded and treated as nodes of a graphical model. The basic rule for the definition of your second input argument v is that it always has the same order of elements as the one used for defining likelihood_model(), and this is the only rule that permeates the definition of v, irrespectively of how many parameters you might have in the model.

If you would like to know a bit about the motivation, ForwardDiff does not differentiate with respect to an element of a vector, this is why I can't just pass v, but instead always pass the first argument to be (possibly) differentiated, which is σ_log in your example, and then use a second input argument v that adheres to the order of components in the likelihood_model() definition.

It is true that you can simply define a closure that reads some of the arguments in its body from the outer scope. This will work fine in an example as the one you provided and you won't need to necessarily define a second argument to your log-target. When the graphical model scales, it becomes less feasible to rely on closures.

I initially thought of using a dictionary as a second argument v instead of a vector. However, for moderate to larger models (i.e. as the dictionary grows), the performance penalty becomes an issue. I chose to favor performance in the design stage, considering that the rule to match the order of elements in v to the order of arguments passed to the first argument of likelihood_model() is not a complicated rule (I agree that this requires some typing attention in order to ensure the order is preserved, while it provides a runtime benefit).

I am not entirely sure what is your setting, it might be possible to define a vector of chains to hold your multiple chains. I agree with you that r-hat is an important and well-established diagnostic and I am planning to add it in the very near future, thanks for bringing it up.

About integration between Klara and Mamba, there hasn't been any work. It is a good idea, but might happen at some later stage after higher items in the action list have been completed (by either involved parties).

JuliaStats / Klara.jl

restricting the support of parameters #148