Open willtebbutt opened 4 years ago
it's not at all clear to me what the appropriate interface is for approximate inference with MCMC, given that we're working outside of a PPL.
Maybe one could make use of the general interface in AbstractMCMC for MCMC inference instead of targeting a specific backend. The implementation in EllipticalSliceSampling is already quite general and uses rand
, rand!
, and loglikelihood
for specifying the prior and likelihood of the model (currently it doesn't allow to specify a model that already implements both loglikelihood
and rand
/rand!
but that could be changed easily), see https://github.com/TuringLang/EllipticalSliceSampling.jl/blob/91097e97fe8864bb43141201345025637a197248/src/model.jl#L42-L55. So it should be sufficient to specify loglikelihood
and rand
/rand!
for a latent GP model to hook into EllipticalSliceSampling.
I did some (preliminary) work with log Gaussian Cox processes a while ago and used EllipticalSliceSampling for some toy inference problems, so I would be very interested if the API could support such models! It would definitely be good to design the API in a way that makes it easy to use different inference algorithms (e.g., I used ESS with circulant matrices (was a major annoyance...) but it would be very nice if one could easily switch to VI or INLA instead).
I'll try to add some more thorough comments in the next days.
One possible issue with your API is that it will be incompatible with likelihood requiring more than one GP, e.g. Softmax, heteroscedastic etc. From my experience it's better to generalize from the beginning to accept K
GPs.
[edit:] Likewise it would be nice to consider multi-output GPs from the start (in the coregionalization kernel sense).
For the VI part, if you would like to incorporate the augmentation methods, the best interface for that would be to get a grad_log_likelihood
function that can easily be overloaded for a certain type of likelihood and inference (at least that's how I do it for AGP.jl). I would be happy to participate/supervise for the VI side.
One possible issue with your API is that it will be incompatible with likelihood requiring more than one GP, e.g. Softmax, heteroscedastic etc. From my experience it's better to generalize from the beginning to accept K GPs.
Hmm yeah, that's good point. I suppose I had imagined that this would be best handled by viewing multi-output GPs as a single GP with a particular kernel structure, and a particular input vector type. That might not be the case though. I'll open an issue on AbstractGPs about this, as whatever choice we make there will propagate through to this package.
For the VI part, if you would like to incorporate the augmentation methods, the best interface for that would be to get a grad_log_likelihood function that can easily be overloaded for a certain type of likelihood and inference (at least that's how I do it for AGP.jl).
Is it not sufficient to just implement a custom rrule
/ frule
for ϕ
if it's got a particular structure that we want to exploit?
edit: additionally, does AugmentedGPs have support for allowing dependencies between the approximate posteriors over likelihood functions that accept multiple GPs?
Is it not sufficient to just implement a custom rrule / frule for ϕ if it's got a particular structure that we want to exploit?
I am not sure. What I get via the augmentation is the analytical (stochastic if needed) gradient given the variational parameters, and in the non-stochastic case this translates into block coordinate ascent updates.
edit: additionally, does AugmentedGPs have support for allowing dependencies between the approximate posteriors over likelihood functions that accept multiple GPs?
No it does not. It assumes mean-field between all GPs
I am not sure. What I get via the augmentation is the analytical (stochastic if needed) gradient given the variational parameters, and in the non-stochastic case this translates into block coordinate ascent updates.
Okay. Well it sounds like there's quite a bit more going on in your grad_log_likelihood
function than the name suggests. Could you open a separate issue to discuss this and variational approximations more generally?
No it does not. It assumes mean-field between all GPs
Ah okay. I would quite like to avoid making this assumption by default if possible. So the advantage of the everything-is-a-single-output-GP approach is that you get dense-by-default, with mean field as a special case.
Hmm yeah, that's good point. I suppose I had imagined that this would be best handled by viewing multi-output GPs as a single GP with a particular kernel structure, and a particular input vector type. That might not be the case though. I'll open an issue on AbstractGPs about this, as whatever choice we make there will propagate through to this package.
That was not my main point though. It was more about the other way around. For example for heteroscedastic regression you will want to have 2 GPs (correlated or not, with different means or not) but only have one output. It goes to the notion of Chained Gaussian Processes
Do you think its a good idea to start off by making this compatible with elliptical slice sampling as @devmotion suggested? We can probably fine tune the different aspects of the API while we do this.
I think having a common interface for as many inference schemes as possible would help make this much more structured.
Regarding ϕ
, is there any set of conditions this function needs to follow? If we are to allow the user to define custom likelihoods. Can we add checks?
That was not my main point though. It was more about the other way around. For example for heteroscedastic regression you will want to have 2 GPs (correlated or not, with different means or not) but only have one output. It goes to the notion of Chained Gaussian Processes
Oh, I see. Well you can express those kinds of models in this framework by making the likelihood depend on more than one location in input-space.
Do you think its a good idea to start off by making this compatible with elliptical slice sampling as @devmotion suggested?
I'm totally on board with this, and it's straightforward to do this if you don't want to tune hyperparameters. I was definitely not thinking we would include stuff for doing inference in the kernel parameters in this package though -- that's something I had envisaged stitching together in GPML.jl, when we would also bring in the notion of priors over kernel parameters etc. My aim for where separation of concerns between this package and GPML.jl for this package to know nothing about kernel parameters. You just give it a kernel and a likelihood function, and defines all of the functionality
We can probably fine tune the different aspects of the API while we do this.
This seems sensible, as long as we're clear what the scope of what we're trying to do is. I'm pretty sure that all we really need to be able to do is sample from the prior over f
, which we can do by virtue of the AbstractGPs
interface, and compute log p(y, f)
(where y
is implicit in ϕ
) / its gradients, which we can again do because the AbstractGPs
interface supports computing log p(f)
. To be honest I'm not sure what else there is that we could need other than some abstractions to make common cases convenient to express.
edit:
we obviously also need to be able to know the structure of ϕ
for some particular approximate inference schemes (e.g. the AugmentedGP approach), but that's a given, because we have ϕ
.
I was definitely not thinking we would include stuff for doing inference in the kernel parameters in this package though -- that's something I had envisaged stitching together in GPML.jl, when we would also bring in the notion of priors over kernel parameters etc.
If it doesn't seem right to do inference of kernel parameters in this package, we could just define the abstractions LatentGP
, logpdf
and a few common ϕ
and move to GPML.jl
to implement the different inference schemes?
Other alternative would be to design simple inference in a notebook for now and later make it a part of GPML.jl
. I am personally not a fan of this idea, as it would be easier to discuss through PRs than using notebooks.
@sharanry 's initial attempt at the above in #3 is great, but it and a comment from him on slack have got me wondering about what I proposed above, in particular whether just specifying the likelihood really makes sense. It's not really clear what the generative interpretation is, and requires us to move away from the Distributions.jl
-like interface, as we wouldn't have a logpdf
and rand
function. This would be quite sad as having those things makes it really clear what your Distribution
is. So I'm going to backtrack on the above proposal in favour of the following.
A LatentGP
is a distribution over a pair of real-valued vectors (that need not be the same length), call them called v
and y
. A LatentGP
has two parameters: a FiniteGP
fx
and a function ϕ
that returns the conditional distribution over y
given v
. Sampling from this model is implemented as follows:
function rand(rng::AbstractRNG, d::LatentGP)
v = rand(rng, d.fx)
y = rand(rng, d.ϕ(v))
return (v=v, y=y)
end
Note that we make no assumptions about the particular distribution that d.ϕ(v)
is, so there is a lot of freedom here to do interesting things. For example this structure supports the Chained GPs framework mentioned by @theogf . For example, if one sets things up such that length(v) == 2 * length(y)
, and the prior GP f
represents a very simple multi-output GP comprising two independent GPs, and it's indexed in the correct manner (I'm assuming here that we've figured out how to do this in AbstractGPs), the types of structure that you see in the Chained GPs framework follow straightforwardly.
In the special case that d.ϕ(v)
is a Gaussian whose mean is some affine transformation of v
, then everything is jointly Gaussian and we get something that's similar to a FiniteGP
in that the marginal distribution over y
is a multivariate Gaussian. However, it differs from a FiniteGP
in that a FiniteGP
represents only the marginal distribution over y
, and doesn't have an explicitly deal with the noise-free latent process when sampling / computing marginal probabilities etc.
More generally we can introduce particular types for ϕ
to make it possible to specialise approximate inference based on those types i.e. implement the augmentation tricks that @theogf works on / implement the more traditional deterministic approximate inference that you can find in GPML.
It's also really clear what the logpdf
function should be:
function logpdf(d::LatentGP, y::NamedTuple{(:v, :y)})
return logpdf(d.fx, y.v) + logpdf(d.ϕ(obs.v), y.y)
end
Most approximate inference schemes will be in the business of finding an approximation to the posterior over v
given y
.
I think this solution is much cleaner anyway as it doesn't introduce any new API components over what we've already got from Distributions
.
@sharanry @devmotion @theogf what do you make of this?
edit:
For example, a manual way to implement a diagonal Gaussian likelihood (I'm not sure why you would ever need to do this in practice, but it's a good example) would be
using Distributions, AbstractGPs
f = GP(Matern52Kernel())
x = range(-5.0, 5.0; length=100)
latent_gp = LatentGP(f(x), v -> Product(Normal.(v, noise_var)))
y = rand(latent_gp)
y.y # length 100 vector
y.v # length 100 vector
logpdf(latent_gp, y) # this works and does what was described above
We could also implement a GaussianConditional
or GaussianLikelihood
type (I'm not sure what the correct naming convention is here, definitely a discussion point) that would simplify the above implementation:
latent_gp = LatentGP(f(x), GaussianConditional(noise_var))
Then you could e.g. dispatch on the GaussianConditional
type to implement custom approximate / exact inference for this particular case. By way of another example, we might consider having a BernoulliConditional
for binary classification with GPs.
It became apparent when discussing approximate inference with pseudo-points that the above design can be a little annoying. See here for details. @sharanry what are your thoughts on this? I think I've become even more convinced over time that it's a good idea. Would only be a small refactor but would make the user-experience way better I think.
This is intended as a discussion issue where we can hash out an initial design for the package. The goal is to
None of this is set in stone, so please feel free to chime in with any thoughts you might have on the matter. In particular if you think that I've missed something obvious from the design that could restrict us down the line, now would be a good time to bring it up.
Background
In an ideal world, the API for GPs with non-Gaussian likelihoods would be "Turing" or "Soss", in the sense that we would just put a GP into a probabilistic programme, and figure out everything from there. This package, however, is not aiming for that level of generality. Rather it is aiming for the tried-and-tested GP + likelihood function API, and providing a robust and well-defined API + collection of approximate inference algorithms to deal with this.
API
Taking a bottom-up approach to design, my thinking is that the following basic structure should be sufficient for our needs:
where
f
is some GP whose inputs are of typeTx
,x
is some subtype ofAbstractVector{Tx}
,ϕ
is a function fromAbstractVector{<:Real}
toReal
that computes the log likelihood a particular sample fromf
atx
, andlog_density(fx, f) := logpdf(fx, f) + ϕ(f)
(it's not clear to me whether this function is ever non-trivial)This structure encompasses all of the standard things that you'll see in ML, but is a little more general, as the likelihood function isn't restricted to be independent over outputs. To make things convenient for users, we can set up a couple of common cases of
ϕ
such as factorised likelihoods: a type that specifies thatϕ(f) = sum(n -> ϕ[n](f[n]), eachindex(x))
, and special cases of likelihoods for classification etc (the various things implemented in GPML). I've not figured out exactly what special cases we want here, so we need to put some thought into that.This interface obviously precludes expressing that the likelihood is a function of entire sample paths from
f
-- see e.g. [1] for an instance of this kind of thing. I can't imagine this being too much of an issue as all of the techniques for actually working with such likelihoods necessarily involve discretising the function, which we can handle. This means that they can still be implemented in an only slightly more ugly manner. If this does turn out to be an actual issue for a number of users, we can always generalise the likelihood a bit.Note that this approach feels quite stan-like, in that it just requires the user to specify a likelihood function.
Approximate Inference + Approximate Inference Interface
This is the bit of the design that I'm least comfortable with. I think that we should focus on getting NUTS / ESS working in the first instance, but it's not at all clear to me what the appropriate interface is for approximate inference with MCMC, given that we're working outside of a PPL. In the first instance I would propose to simply provide well documented examples that show how to leverage the above structure in conjunction with e.g. AdvancedHMC to perform approximate inference. It's possible that we really only want to provide this functionality at the GPML.jl level, since you really need to include all of the parameters of the model, both the function
f
and any kernel parameters, to do anything meaningful.The variational inference setting is probably a bit clearer what to do, because you can meaningfully talk about ELBOs etc without talking too much about any kernel parameters. e.g. we might implement function along the lines of
elbo(fx, q)
, whereq
is some approximate posterior overf(x)
. It's going to be a little bit down the line before we start looking at this though, possibly we won't get to it at all over the summer, although it would definitely be good to look at how to get some of the stuff from AugmentedGaussianProcesses into this package. @theogf do you have any thoughts on the kinds of things that would be necessary from an interface-perspective to make this feasible?Summary
In short, this package is likely to be quite small for a while -- more or less just a single new type and some corresponding documentation while we consider MCMC. I would envisage that this package will come into its own when we really start going for variational inference a little bit further down the line.
@yebai @sharanry @devmotion @theogf -- I would appreciate your input.
[1] - Cotter, Simon L., et al. "MCMC methods for functions: modifying old algorithms to make them faster." Statistical Science (2013): 424-446.