TuringLang / Turing.jl

Bayesian inference with probabilistic programming.
https://turinglang.org
MIT License
2.04k stars 219 forks source link

Apply AD gradient if `optimizer` is a first-order one #1365

Closed wupeifan closed 4 years ago

wupeifan commented 4 years ago

In estimating MLE and MAP, the routines from Optim.jl are called. However, even for gradient-based methods, Optim.jl only support ForwardDiff as an AD backend, and apply finite differences otherwise. Therefore, it makes sense to use the AD backend in Turing (Turing.setadbackend) to define the gradient function and feed it to the optimizer.

Basically, we replace https://github.com/TuringLang/Turing.jl/blob/master/src/modes/ModeEstimation.jl#L383 by a structure like

if optimizer isa Optim.FirstOrderOptimizer
   ...
else
   ...
end

In the first branch, you will need to pass the original function and another that computes the function value and gradient-based on Turing.gradient_logp. An example is https://github.com/JuliaNLSolvers/Optim.jl/blob/7b660484724755ee2b306dd3ceed13d6633067ae/test/multivariate/optimize/interface.jl#L54

pkofod commented 4 years ago

Let me know if you need help here. Is it possible to calculate the gradient and get the value simultaneously with your setup? That would help performance.

wupeifan commented 4 years ago

I think in this setting the gradient is Turing.gradient_logp. Therefore, you can wrap the value and the gradient together. @mohamed82008 also pointed out this https://github.com/JuliaNLSolvers/NLSolversBase.jl/blob/9e89526817d5932489d8fdfead5f6e537c1291a1/src/objective_types/oncedifferentiable.jl#L172 in Slack discussion thread.

devmotion commented 4 years ago

I guess the branch is not needed? It seems one could just define

function (f::OptimLogDensity)(F, G, H, x)
    if G !== nothing
        ...
    end
    if H !== nothing
        ...
    end
    if F !== nothing
        return ...
    end
    nothing
end

and then call Optim.optimize(Optim.only_fgh!(f), ...), similar to https://github.com/JuliaNLSolvers/Optim.jl/blob/7b660484724755ee2b306dd3ceed13d6633067ae/test/multivariate/optimize/interface.jl#L73-L78.

wupeifan commented 4 years ago

Yeah, probably. Thanks! I'm swamped with more urgent stuff so I might not be able to make a PR myself for this one... Maybe wait for Cameron after he finishes his qualifier, or maybe I'll revisit this in a couple of days...

cpfiffer commented 4 years ago

@mohamed82008 what would be a quick way to get the Hessian as well? I'm switching things up a little to use gradient_logp to get the gradient too, but I'm wondering if I should add a method like hessian_logp. Anyone have thoughts on that?

wupeifan commented 4 years ago

Hessians are usually computationally expensive and that’s why the original Newton method is not preferred. I don’t think it is that necessary to add Hessians for optimization per se but maybe other people have different ideas.

That said, I think it would be great if the information matrix can be provided...

cpfiffer commented 4 years ago

Well, if we're going to use the Optim.only_fgh! method (or the non-Hessian variant Optim.only_fg!) we should probably consider extending support for Hessians in an efficient way in case someone wants to use Newton or the other Hessian-based methods.

We've got support already for the information matrix, but I'm not sure if it's finite-difference based or not (I think it is) -- you can do it with

using StatsBase

m = optimize(model, MLE())
StatsBase.informationmatrix(m)
wupeifan commented 4 years ago

in case someone wants to use Newton or the other Hessian-based methods.

For small scale problems yeah. I don't know, maybe other people have better ideas.

mohamed82008 commented 4 years ago

I'm wondering if I should add a method like hessian_logp.

I think that's reasonable, but perhaps as a separate PR. For now, we can say that second-order methods are not supported if H !== nothing.

wupeifan commented 4 years ago

We've got support already for the information matrix, but I'm not sure if it's finite-difference based or not (I think it is) -- you can do it with


using StatsBase

m = optimize(model, MLE()) StatsBase.informationmatrix(m)



The information matrix actually spits an error. I think it calls ForwardDiff automatically. @cpfiffer 
cpfiffer commented 4 years ago

Welp, all the more reason to get the actual Hessian stuff built in too.

ChrisRackauckas commented 4 years ago

@mohamed82008 what would be a quick way to get the Hessian as well? I'm switching things up a little to use gradient_logp to get the gradient too, but I'm wondering if I should add a method like hessian_logp. Anyone have thoughts on that?

Just do forward mode over whatever reverse mode is chosen.

wupeifan commented 4 years ago

Just do forward mode over whatever reverse mode is chosen.

It could be possible that under custom adjoints the users cannot provide a Hessian in some of the intermediate steps, so the forward mode might not proceed. I think it's not that trivial when custom adjoints are provided; however, what you said should be feasible if everything is taken care of by an AD backend...

wupeifan commented 4 years ago

Thanks for implementing this feature request! I just experimented with the new codes and they work perfectly.

cpfiffer commented 4 years ago

Did you get a noticeable speed up of any kind? I'd expect the AD gradient to be a lot faster and more precise in general.

On Thu, Aug 20, 2020 at 10:24 AM Peifan Wu notifications@github.com wrote:

Closed #1365 https://github.com/TuringLang/Turing.jl/issues/1365.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TuringLang/Turing.jl/issues/1365#event-3676328636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADHITXRDZNTZA3ICH64VSTSBVL3VANCNFSM4PLEYVDA .

wupeifan commented 4 years ago

@cpfiffer It's rather a feasibility issue for me as previously I can't run gradient-based methods. For sure it will be faster than the previous simplex method