TuringLang / Turing.jl

Bayesian inference with probabilistic programming.
https://turinglang.org
MIT License
2.05k stars 219 forks source link

Automatic pre-allocation #1241

Closed mohamed82008 closed 2 years ago

mohamed82008 commented 4 years ago

This package https://github.com/oxinabox/AutoPreallocation.jl provides a way to pre-allocate all the memory needed in a function automatically. Might be worth supporting this as an optimization in the function which computes the value and gradient of logp before passing it to AHMC or other inference algos.

yebai commented 4 years ago

As a rule of thumb, I don't think we should support this type of optimisation from Turing. Instead, we can make sure Turing provides enough flexibility for the user to do these optimisations if required.

mohamed82008 commented 4 years ago

Yes we can optionally "support" it with Requires and a kwarg or something.

yebai commented 4 years ago

Yes we can optionally "support" it with Requires and a kwarg or something.

I am not sure. We want to keep Turing focused on "inference" as much as possible moving forward. Everything that can be done on the Julia side should be left outside Turing unless there is an exceptionally strong case.

mohamed82008 commented 4 years ago

Hmm, maybe we can provide a function to return the functions to be passed to AHMC. The user can then modify these functions before passing them to AHMC or AMH manually.

mohamed82008 commented 4 years ago

Currently, there aren't entry points where the user can modify things between the model definition and sampling. This makes sense in general because most users won't need to modify the function after it has been defined in Julia. But in the world of Cassette, the user can modify the model function after it has been defined enabling optimizations like AutoPreallocation.jl. Memoization is another optimization. So I think providing an entry point to modify the model function and/or its gradient function before running sampling might be along the lines of your comment.

devmotion commented 4 years ago

I think these optimizations and customizations should be possible if we would use callable structs in Turing instead of anonymous functions, as outlined in https://github.com/TuringLang/Turing.jl/pull/1230#discussion_r413531662. Then a user should be able to define its own implementation of (::LogGradientFunction{<:MyCustomModel,<:Sampler})(theta), and e.g. also memoize or preallocate it.

mohamed82008 commented 4 years ago

I think these are 2 orthogonal concerns. Whether we return the logp function as a closure or a callable struct, both would enable user optimizations. I think the main challenge here is to figure out a nice API that "feels right". By that I mean, it should work with most if not all the inference algorithms (although we can start with HMC and MH) but also provide the user with enough flexibility.

Perhaps along the lines of your comment, we can use the name of the model as part of its type. That way the user can overload internal Turing functions for Model{:gdemo} (MyCustomModel in your comment) for example modifying the behavior of any Turing function (not that this is a good idea in general!). But I don't see how that can help us in this case. The user still needs the original function to optimize it and then the new function needs to be "injected" into the Turing pipeline to proceed to the sampling. This injection can happen by overloading (::LogGradientFunction)(theta) and calling the new function (but that will result in a closure with a global variable if in REPL) or we can just provide a way to swap out spl.state.h.ℓπ and spl.state.h.∂ℓπ∂θ perhaps via a callback function or a swap_xyz function. Doing this generically is an interesting API design problem. I like the callback idea.

devmotion commented 4 years ago

Perhaps along the lines of your comment, we can use the name of the model as part of its type. That way the user can overload internal Turing functions for Model{:gdemo} (MyCustomModel in your comment) for example modifying the behavior of any Turing function (not that this is a good idea in general!).

Why would this be needed? You could always just run

m = model(x, y)
typeof(m)

to get the type of the model with observations x and y?

The user still needs the original function to optimize it and then the new function needs to be "injected" into the Turing pipeline to proceed to the sampling. This injection can happen by overloading (::LogGradientFunction)(theta) and calling the new function (but that will result in a closure with a global variable if in REPL) or we can just provide a way to swap out spl.state.h.ℓπ and spl.state.h.∂ℓπ∂θ perhaps via a callback function or a swap_xyz function. Doing this generically is an interesting API design problem. I like the callback idea.

I'm sorry, I don't quite follow. My point was that if you have something like

struct LogGradientFunction{M<:Model,S<:Sampler,V<:VarInfo}
    model::M
    sampler::S
    varinfo::V
end

function (f::LogGradientFunction)(theta)
    varinfo = f.varinfo
    sampler = f.sampler

    thetaold = varinfo[sampler]
    logpold = getlogp(varinfo)

    varinfo[sampler] = x
    f.model(varinfo, sampler)
    logp = getlogp(varinfo)

    varinfo[sampler] = thetaold
    setlogp!(varinfo, logpold)

    return logp
end

in Turing and just construct a LogGradientFunction in the step of whatever sampler, users can just implement

m = model(x, y)

function (f::Turing.LogGradientFunction{typeof(m)})(theta)
    # .... whatever optimized implementation
end

without hooking into any other Turing internals. But maybe there's some problem with this approach that I don't see right now? I mean it's pure speculation and I haven't tried to actually perform any such optimizations :smile:

mohamed82008 commented 4 years ago

The problem is that there is no optimized implementation :) The optimization requires the original function and then we need to replace it with the output of AutoPreallocator.preallocate for example. This needs to be done once only not every time ::LogGradientFunction is called. So the API I am thinking is:

m = model(x, y)
f = getlogpf(m)
newf = preallocate(f, VarInfo(m)[SampleFromPrior()])
spl = Sampler(HMC(...), m, logp = newf)
sample(spl, nsamples)

Or

m = model(x, y)
spl = Sampler(HMC(..), m)
f = getlogpf(spl)
newf = preallocate(f, spl.state.vi[spl]) # spl.state.vi[spl] should probably get a function in Turing
new_spl = Sampler(spl, logp = newf)
sample(new_spl, nsamples)
mohamed82008 commented 4 years ago

I think Sampler is a good intermediate entry point for these kinds of optimizations.

devmotion commented 4 years ago

Again pure speculation, but couldn't you write

m = model(x, y)
spl = Sampler(HMC(...), m)

f(model, varinfo, sampler) = model(varinfo, sampler)
newf = preallocate(f, m, spl, VarInfo(m))

function (f::LogGradientFunction{typeof(m)})(theta)
    varinfo = f.varinfo
    sampler = f.sampler

    thetaold = varinfo[sampler]
    logpold = getlogp(varinfo)

    varinfo[sampler] = x
    newf(f.model, varinfo, sampler)
    logp = getlogp(varinfo)

    varinfo[sampler] = thetaold
    setlogp!(varinfo, logpold)

    return logp
end
mohamed82008 commented 4 years ago

Hmm I guess that could work but it is too verbose and exposes a lot of the internals of Turing which I would rather keep hidden to have the freedom of changing in the future. Also it doesn't capture all the allocations, e.g. thetaold = varinfo[sampler] will still allocate. And if the user is not careful, newf may be a global variable which will kill perf. I prefer a simpler API with less chances of foot-shooting.

mohamed82008 commented 4 years ago

Whether we provide a flag or kwarg for doing this kind of optimization automatically or not is a separate issue. I think for most of our users having a single kwarg is more convenient than having to tinker with Sampler and external packages. But providing a tinkering mechanism is more important to enable any other kind of optimization and not to commit to a specific package even for a single optimization.

devmotion commented 4 years ago

Also it doesn't capture all the allocations, e.g. thetaold = varinfo[sampler] will still allocate.

Just change it to

m = model(x, y)
spl = Sampler(HMC(...), m)

function f(model, sampler, theta)
    thetaold = varinfo[sampler]
    logpold = getlogp(varinfo)

    varinfo[sampler] = theta
    model(varinfo, sampler)
    logp = getlogp(varinfo)

    varinfo[sampler] = thetaold
    setlogp!(varinfo, logpold)

    return logp
end
newf = preallocate(f, m, spl, VarInfo(m)[spl])

(f::LogGradientFunction{typeof(m)})(theta) = newf(f.model, f.sampler, theta)

I prefer a simpler API with less chances of foot-shooting.

One could just make this snippet a macro in an external package such as AdvancedTuringStuff if it works.

I think for most of our users having a single kwarg is more convenient than having to tinker with Sampler and external packages.

I just don't think it's reasonable to try to support all kind of possible use cases and external packages with keyword arguments. It seems it might be possible to achieve the functionality that you had in mind by overloading ::LogGradientFunction, so IMO that is better than adding keyword arguments just to support this (IMO quite advanced) use case. IMO Turing should allow such flexibility but such advanced features should not be part of the basic API - as @yebai said, IMO

Everything that can be done on the Julia side should be left outside Turing unless there is an exceptionally strong case.

mohamed82008 commented 4 years ago

So thinking about this, this will run into world age problems if the method is defined inside a function.

mohamed82008 commented 4 years ago

And if in REPL, newf will be a global variable that is closed over unless made a constant.

devmotion commented 4 years ago

Yep, but you can surely make it a constant? And I guess another alternative would be to use Cassette - I mean, it's not a huge step from AutoPreallocation (which uses Cassette internally) to then also using Cassette to overdub calls of (::LogGradientFunction{<:MyModel})(x). And actually that's a lot easier if you don't deal with closures of arbitrary types for the log gradient function.

mohamed82008 commented 4 years ago

Yep, but you can surely make it a constant?

Sure, but it is less convenient. With the Sampler approach, the function barrier avoids the need for this.

I mean, it's not a huge step from AutoPreallocation (which uses Cassette internally) to then also using Cassette to overdub calls of (::LogGradientFunction{<:MyModel})(x).

I don't think asking users to use Cassette to have access to AutoPreallocation is a good API :) Unless you mean for us to use Cassette to mimic the effect of method overloading from within a function. I am not sure if that's possible but if it is, it would be a cool idea for a separate package that we can then use. It's definitely way more hackish than a simple swapping of the function in Sampler though. Personally, I don't think we should use Cassette unless there is no other alternative.

devmotion commented 4 years ago

Unless you mean for us to use Cassette to mimic the effect of method overloading from within a function.

Yep, I was thinking about providing such a functionality in an external package (if it seems useful).

It's definitely way more hackish than a simple swapping of the function in Sampler though.

Sure but IMO it's not a use case that I would expect Turing to support natively in the first place. I just think of Turing as a tool that simplifies the implementation and inference of a very general class of probabilistic models in a reasonably fast way, and hence a main point is that I don't have to care about implementing the log density function explicitly. Additionally, IMO from a developer perspective it is nice if Turing's basic implementation is as simple as possible, since that will surely simplify maintainability. Ideally you can extend Turing if you want to speed up inference (and it seems that might be possible, e.g., with Cassette for your use case here). If you want to really optimize every tiny bits of your HMC inference, an alternative would be to use AdvancedHMC directly.

mohamed82008 commented 4 years ago

Sure but IMO it's not a use case that I would expect Turing to support natively in the first place.

As a user, if I can write 4 extra lines of code and get a significant speedup, I would love for Turing to provide that.

Additionally, IMO from a developer perspective it is nice if Turing's basic implementation is as simple as possible, since that will surely simplify maintainability.

I agree so let's find a simple implementation :) A Cassette-based approach is far from simple imo.

If you want to really optimize every tiny bits of your HMC inference, an alternative would be to use AdvancedHMC directly.

This optimization is not HMC-specific, it can apply to MH and Gibbs sampling as well. Letting the user hack their way through Turing to pre-allocate for Gibbs sampling is a near impossible task unless that user is a Turing developer as well ;)