Package Design - Githubissues

willtebbutt commented 4 years ago

This is intended as a discussion issue where we can hash out an initial design for the package. The goal is to

tie down the structure of the package, including how the internals should be designed to lead to make implementing a pleasant user-facing API
determine what flavours of approximate inference we want to target first / what is feasible for @sharanry to target over the summer.

None of this is set in stone, so please feel free to chime in with any thoughts you might have on the matter. In particular if you think that I've missed something obvious from the design that could restrict us down the line, now would be a good time to bring it up.

Background

In an ideal world, the API for GPs with non-Gaussian likelihoods would be "Turing" or "Soss", in the sense that we would just put a GP into a probabilistic programme, and figure out everything from there. This package, however, is not aiming for that level of generality. Rather it is aiming for the tried-and-tested GP + likelihood function API, and providing a robust and well-defined API + collection of approximate inference algorithms to deal with this.

API

Taking a bottom-up approach to design, my thinking is that the following basic structure should be sufficient for our needs:

f = GP(m, k)
fx = LatentGP(f, x, ϕ)
log_density(fx, f)

where

f is some GP whose inputs are of type Tx,
x is some subtype of AbstractVector{Tx},
ϕ is a function from AbstractVector{<:Real} to Real that computes the log likelihood a particular sample from f at x, and
log_density(fx, f) := logpdf(fx, f) + ϕ(f) (it's not clear to me whether this function is ever non-trivial)

This structure encompasses all of the standard things that you'll see in ML, but is a little more general, as the likelihood function isn't restricted to be independent over outputs. To make things convenient for users, we can set up a couple of common cases of ϕ such as factorised likelihoods: a type that specifies that ϕ(f) = sum(n -> ϕ[n](f[n]), eachindex(x)), and special cases of likelihoods for classification etc (the various things implemented in GPML). I've not figured out exactly what special cases we want here, so we need to put some thought into that.

This interface obviously precludes expressing that the likelihood is a function of entire sample paths from f -- see e.g. [1] for an instance of this kind of thing. I can't imagine this being too much of an issue as all of the techniques for actually working with such likelihoods necessarily involve discretising the function, which we can handle. This means that they can still be implemented in an only slightly more ugly manner. If this does turn out to be an actual issue for a number of users, we can always generalise the likelihood a bit.

Note that this approach feels quite stan-like, in that it just requires the user to specify a likelihood function.

Approximate Inference + Approximate Inference Interface

This is the bit of the design that I'm least comfortable with. I think that we should focus on getting NUTS / ESS working in the first instance, but it's not at all clear to me what the appropriate interface is for approximate inference with MCMC, given that we're working outside of a PPL. In the first instance I would propose to simply provide well documented examples that show how to leverage the above structure in conjunction with e.g. AdvancedHMC to perform approximate inference. It's possible that we really only want to provide this functionality at the GPML.jl level, since you really need to include all of the parameters of the model, both the function f and any kernel parameters, to do anything meaningful.

The variational inference setting is probably a bit clearer what to do, because you can meaningfully talk about ELBOs etc without talking too much about any kernel parameters. e.g. we might implement function along the lines of elbo(fx, q), where q is some approximate posterior over f(x). It's going to be a little bit down the line before we start looking at this though, possibly we won't get to it at all over the summer, although it would definitely be good to look at how to get some of the stuff from AugmentedGaussianProcesses into this package. @theogf do you have any thoughts on the kinds of things that would be necessary from an interface-perspective to make this feasible?

Summary

In short, this package is likely to be quite small for a while -- more or less just a single new type and some corresponding documentation while we consider MCMC. I would envisage that this package will come into its own when we really start going for variational inference a little bit further down the line.

@yebai @sharanry @devmotion @theogf -- I would appreciate your input.

[1] - Cotter, Simon L., et al. "MCMC methods for functions: modifying old algorithms to make them faster." Statistical Science (2013): 424-446.

devmotion commented 4 years ago

it's not at all clear to me what the appropriate interface is for approximate inference with MCMC, given that we're working outside of a PPL.

Maybe one could make use of the general interface in AbstractMCMC for MCMC inference instead of targeting a specific backend. The implementation in EllipticalSliceSampling is already quite general and uses rand, rand!, and loglikelihood for specifying the prior and likelihood of the model (currently it doesn't allow to specify a model that already implements both loglikelihood and rand/rand! but that could be changed easily), see https://github.com/TuringLang/EllipticalSliceSampling.jl/blob/91097e97fe8864bb43141201345025637a197248/src/model.jl#L42-L55. So it should be sufficient to specify loglikelihood and rand/rand! for a latent GP model to hook into EllipticalSliceSampling.

I did some (preliminary) work with log Gaussian Cox processes a while ago and used EllipticalSliceSampling for some toy inference problems, so I would be very interested if the API could support such models! It would definitely be good to design the API in a way that makes it easy to use different inference algorithms (e.g., I used ESS with circulant matrices (was a major annoyance...) but it would be very nice if one could easily switch to VI or INLA instead).

I'll try to add some more thorough comments in the next days.

theogf commented 4 years ago

One possible issue with your API is that it will be incompatible with likelihood requiring more than one GP, e.g. Softmax, heteroscedastic etc. From my experience it's better to generalize from the beginning to accept K GPs. [edit:] Likewise it would be nice to consider multi-output GPs from the start (in the coregionalization kernel sense).

For the VI part, if you would like to incorporate the augmentation methods, the best interface for that would be to get a grad_log_likelihood function that can easily be overloaded for a certain type of likelihood and inference (at least that's how I do it for AGP.jl). I would be happy to participate/supervise for the VI side.

willtebbutt commented 4 years ago

One possible issue with your API is that it will be incompatible with likelihood requiring more than one GP, e.g. Softmax, heteroscedastic etc. From my experience it's better to generalize from the beginning to accept K GPs.

Hmm yeah, that's good point. I suppose I had imagined that this would be best handled by viewing multi-output GPs as a single GP with a particular kernel structure, and a particular input vector type. That might not be the case though. I'll open an issue on AbstractGPs about this, as whatever choice we make there will propagate through to this package.

For the VI part, if you would like to incorporate the augmentation methods, the best interface for that would be to get a grad_log_likelihood function that can easily be overloaded for a certain type of likelihood and inference (at least that's how I do it for AGP.jl).

Is it not sufficient to just implement a custom rrule / frule for ϕ if it's got a particular structure that we want to exploit?

edit: additionally, does AugmentedGPs have support for allowing dependencies between the approximate posteriors over likelihood functions that accept multiple GPs?

theogf commented 4 years ago

Is it not sufficient to just implement a custom rrule / frule for ϕ if it's got a particular structure that we want to exploit?

I am not sure. What I get via the augmentation is the analytical (stochastic if needed) gradient given the variational parameters, and in the non-stochastic case this translates into block coordinate ascent updates.

edit: additionally, does AugmentedGPs have support for allowing dependencies between the approximate posteriors over likelihood functions that accept multiple GPs?

No it does not. It assumes mean-field between all GPs

willtebbutt commented 4 years ago

I am not sure. What I get via the augmentation is the analytical (stochastic if needed) gradient given the variational parameters, and in the non-stochastic case this translates into block coordinate ascent updates.

Okay. Well it sounds like there's quite a bit more going on in your grad_log_likelihood function than the name suggests. Could you open a separate issue to discuss this and variational approximations more generally?

No it does not. It assumes mean-field between all GPs

Ah okay. I would quite like to avoid making this assumption by default if possible. So the advantage of the everything-is-a-single-output-GP approach is that you get dense-by-default, with mean field as a special case.

theogf commented 4 years ago

Hmm yeah, that's good point. I suppose I had imagined that this would be best handled by viewing multi-output GPs as a single GP with a particular kernel structure, and a particular input vector type. That might not be the case though. I'll open an issue on AbstractGPs about this, as whatever choice we make there will propagate through to this package.

That was not my main point though. It was more about the other way around. For example for heteroscedastic regression you will want to have 2 GPs (correlated or not, with different means or not) but only have one output. It goes to the notion of Chained Gaussian Processes

sharanry commented 4 years ago

Do you think its a good idea to start off by making this compatible with elliptical slice sampling as @devmotion suggested? We can probably fine tune the different aspects of the API while we do this.

I think having a common interface for as many inference schemes as possible would help make this much more structured.

Regarding ϕ, is there any set of conditions this function needs to follow? If we are to allow the user to define custom likelihoods. Can we add checks?

willtebbutt commented 4 years ago

That was not my main point though. It was more about the other way around. For example for heteroscedastic regression you will want to have 2 GPs (correlated or not, with different means or not) but only have one output. It goes to the notion of Chained Gaussian Processes

Oh, I see. Well you can express those kinds of models in this framework by making the likelihood depend on more than one location in input-space.

willtebbutt commented 4 years ago

Do you think its a good idea to start off by making this compatible with elliptical slice sampling as @devmotion suggested?

I'm totally on board with this, and it's straightforward to do this if you don't want to tune hyperparameters. I was definitely not thinking we would include stuff for doing inference in the kernel parameters in this package though -- that's something I had envisaged stitching together in GPML.jl, when we would also bring in the notion of priors over kernel parameters etc. My aim for where separation of concerns between this package and GPML.jl for this package to know nothing about kernel parameters. You just give it a kernel and a likelihood function, and defines all of the functionality

We can probably fine tune the different aspects of the API while we do this.

This seems sensible, as long as we're clear what the scope of what we're trying to do is. I'm pretty sure that all we really need to be able to do is sample from the prior over f, which we can do by virtue of the AbstractGPs interface, and compute log p(y, f) (where y is implicit in ϕ) / its gradients, which we can again do because the AbstractGPs interface supports computing log p(f). To be honest I'm not sure what else there is that we could need other than some abstractions to make common cases convenient to express.

edit: we obviously also need to be able to know the structure of ϕ for some particular approximate inference schemes (e.g. the AugmentedGP approach), but that's a given, because we have ϕ.

sharanry commented 4 years ago

I was definitely not thinking we would include stuff for doing inference in the kernel parameters in this package though -- that's something I had envisaged stitching together in GPML.jl, when we would also bring in the notion of priors over kernel parameters etc.

If it doesn't seem right to do inference of kernel parameters in this package, we could just define the abstractions LatentGP, logpdf and a few common ϕ and move to GPML.jl to implement the different inference schemes?

Other alternative would be to design simple inference in a notebook for now and later make it a part of GPML.jl. I am personally not a fan of this idea, as it would be easier to discuss through PRs than using notebooks.

willtebbutt commented 4 years ago

@sharanry 's initial attempt at the above in #3 is great, but it and a comment from him on slack have got me wondering about what I proposed above, in particular whether just specifying the likelihood really makes sense. It's not really clear what the generative interpretation is, and requires us to move away from the Distributions.jl-like interface, as we wouldn't have a logpdf and rand function. This would be quite sad as having those things makes it really clear what your Distribution is. So I'm going to backtrack on the above proposal in favour of the following.

Proposal v2.0

A LatentGP is a distribution over a pair of real-valued vectors (that need not be the same length), call them called v and y. A LatentGP has two parameters: a FiniteGP fx and a function ϕ that returns the conditional distribution over y given v. Sampling from this model is implemented as follows:

function rand(rng::AbstractRNG, d::LatentGP)
    v = rand(rng, d.fx)
    y = rand(rng, d.ϕ(v))
    return (v=v, y=y)
end

Note that we make no assumptions about the particular distribution that d.ϕ(v) is, so there is a lot of freedom here to do interesting things. For example this structure supports the Chained GPs framework mentioned by @theogf . For example, if one sets things up such that length(v) == 2 * length(y), and the prior GP f represents a very simple multi-output GP comprising two independent GPs, and it's indexed in the correct manner (I'm assuming here that we've figured out how to do this in AbstractGPs), the types of structure that you see in the Chained GPs framework follow straightforwardly.

In the special case that d.ϕ(v) is a Gaussian whose mean is some affine transformation of v, then everything is jointly Gaussian and we get something that's similar to a FiniteGP in that the marginal distribution over y is a multivariate Gaussian. However, it differs from a FiniteGP in that a FiniteGP represents only the marginal distribution over y, and doesn't have an explicitly deal with the noise-free latent process when sampling / computing marginal probabilities etc.

More generally we can introduce particular types for ϕ to make it possible to specialise approximate inference based on those types i.e. implement the augmentation tricks that @theogf works on / implement the more traditional deterministic approximate inference that you can find in GPML.

It's also really clear what the logpdf function should be:

function logpdf(d::LatentGP, y::NamedTuple{(:v, :y)})
    return logpdf(d.fx, y.v) + logpdf(d.ϕ(obs.v), y.y)
end

Most approximate inference schemes will be in the business of finding an approximation to the posterior over v given y.

I think this solution is much cleaner anyway as it doesn't introduce any new API components over what we've already got from Distributions.

@sharanry @devmotion @theogf what do you make of this?

edit:

For example, a manual way to implement a diagonal Gaussian likelihood (I'm not sure why you would ever need to do this in practice, but it's a good example) would be

using Distributions, AbstractGPs
f = GP(Matern52Kernel())
x = range(-5.0, 5.0; length=100)
latent_gp = LatentGP(f(x), v -> Product(Normal.(v, noise_var)))
y = rand(latent_gp)
y.y # length 100 vector
y.v # length 100 vector
logpdf(latent_gp, y) # this works and does what was described above

We could also implement a GaussianConditional or GaussianLikelihood type (I'm not sure what the correct naming convention is here, definitely a discussion point) that would simplify the above implementation:

latent_gp = LatentGP(f(x), GaussianConditional(noise_var))

Then you could e.g. dispatch on the GaussianConditional type to implement custom approximate / exact inference for this particular case. By way of another example, we might consider having a BernoulliConditional for binary classification with GPs.

willtebbutt commented 4 years ago

It became apparent when discussing approximate inference with pseudo-points that the above design can be a little annoying. See here for details. @sharanry what are your thoughts on this? I think I've become even more convinced over time that it's a good idea. Would only be a small refactor but would make the user-experience way better I think.

JuliaGaussianProcesses / GPLikelihoods.jl

Package Design #1

Background

API

Approximate Inference + Approximate Inference Interface

Summary

Proposal v2.0