GPflow / GPflow

Gaussian processes in TensorFlow
Apache License 2.0
1.85k stars 434 forks source link

Model Composition #297

Open Bonnevie opened 7 years ago

Bonnevie commented 7 years ago

Are there any design principles in place for model composition? A "basic" example could be defining two sparse Gaussian GPs and then adding a joint likelihood of some kind. I'm guessing it would be perfectly doable by just building it from the ground up in a new Model subclass, but that seems like poor code design when you already have code for e.g. the GPs, as is the case here.

Could we just define a top-level model and build its likelihood as the sum of the likelihoods of the submodels? There seems to be an issue with tf_mode from my meager experiments where we need the submodel likelihoods to be added to the "master" model's graph.

Ideally, we could borrow from the compositionality ideas and syntax in Edward (http://edwardlib.org/api/model-compositionality), but that might require a pretty heavy-handed restructuring.

jameshensman commented 7 years ago

Hi @Bonnevie

Lots of bits of the sparse GP are composable. The basic recipe for a variational sparse GP is

You can adapt that easily to have two GPs.

Ideally...

I disagree that this is ideal! I much prefer to define the parameters, write down the likelihood. And a core STAN developer once hinted that he might have preferred stan work that way too, but I digress.

As for composing models, I think it should be possible to add a model as a sub-structure in another model, but the 'top level' build_likelihood function would have to contain calls to the build_likelihood functions beneath.

dustinvtran commented 7 years ago

in edward you can define the parameters, then write down the likelihood. we don't represent a model per say as anything beyond a collection of random variables. this means you can infer a random variable that is used in multiple models, or plug in an inferred posterior into another model as components of model -> inference -> criticism -> ... are chained together.

i think the key advantage of compositionality is in inference, where you can perform conditional inference over a substructure of a model (e.g., variational inference for a sparse GP component) while doing something else (e.g., black box methods) for the joint likelihood piece.

it would be nice to take advantage of gpflow's algorithms as gps are scaffolded within larger models.

Bonnevie commented 7 years ago

@jameshensman - don't get me wrong, I'm a huge fan of the GPflow code base (it's basically a master-class in object-oriented machine learning code and you've managed to exploit the advantages of TensorFlow without sacrificing the interface of GPy), but from your description it sounds like every new GP-based model architecture has to be coded from the bottom up, with a certain amount of copy-pasting from source involved.

Would calling build_likelihood() of a submodel inside a higher-level model work out of the box? What does it take to have calls to tf_mode() and make_tf_array() in the top-level model propagate down? Since Model subclasses Parameterized, any new Model object added as an attribute should be appended to sorted_params automatically, which both tf_mode() and make_tf_array() loops over, right? Are there any other immediate pitfalls?

edit: or could another model be added as a prior to a parameter?

(side comment: Even if you don't agree with Edward's way of structuring probabilistic models, I think there are many of us that would love if the two packages integrated in some way or another. GPs are certainly a special enough type of model to warrant their own self-contained package, but there are also a lot of occasions where they work very well in conjunction with classic parametric probabilistic models. But it might be a bit of a pipe dream, I'm still not a 100% clear on how similar the GPflow is to Edward or other probabilistic packages under the hood)

alexggmatthews commented 7 years ago

@Bonnevie thanks for your interest in our work.

I think Edward is good interesting work from our colleagues at Columbia but it isn't the only show in town. For instance I think that Stan is great software including the pioneering Stan Math library.

You've hit the nail on the head- GPs are special in a number of ways. This means that in GP inference some things are easier and some things are harder than other models. I would argue they are also important models. Consequently there is a thriving community of people working on them specifically and a number of good specialist GP software packages available such at GPML, GPy, GPStuff (apologies to any of the packages I have missed).

To me the issue here is not compatability with any single TF package, it is the ability to get at the underlying TensorFlow workings of GPflow. For instance it would also be nice to integrate it with neural network models either in native TF, Keras, or another such package. This more general compatability is on our to do list.

I think it is very unlikely you will see an integration with Edward or any similar project for the reasons we've discussed above and due to the fact that we're really happy with the GPflow project structure and team as it is.

jameshensman commented 7 years ago

Thanks for all the great discussion.

To address some specific questions:

certain amount of copy-pasting from source involved

Hmm, I wasn't trying to advocate for copy-paste, but for reusing the functions that we've split out, like GPflow.conditionals.conditional.

Would calling build_likelihood() of a submodel inside a higher-level model work out of the box?

I suspect so, and I think we'd be happy to accept changes to make it so if not, so long as they aren't too intrusive.

could another model be added as a prior to a parameter?

That sounds like an interesting feature, and I guess that's what @dustinvtran is getting at. I don't have plans to make it so, but I'd be really interested to hear a specific (GP-related) use case.

jameshensman commented 7 years ago

As suspected, creating a model that contains models seems to work without any modifcation:

import GPflow
from functools import reduce

X1, Y1, X2, Y2 = my_data_loading_function()
m1 = GPflow.gpr.GPR(X1, Y1, GPflow.kernels.Matern32(1))
m2 = GPflow.gpr.GPR(X2, Y2, GPflow.kernels.Matern32(1))

class ProductModel(GPflow.model.Model):
    def __init__(self, models):
        GPflow.model.Model.__init__(self)
        self.models = GPflow.param.ParamList(models)

    def build_likelihood(self):
        return reduce(tf.add, (m.build_likelihood() for m in self.models))

m = ProductModel([m1, m2])
m.optimize()  # works as expected
print(m)  # works as expected
Bonnevie commented 7 years ago

Ah, thanks for trying it out @jameshensman, was trying a similar example but messed up by using tf.scan in place of reduce in a misguided attempt to be idiomatic (the function apparently doesn't play well with normal lists).

jameshensman commented 7 years ago

@Bonnevie , could I have a quick look at your non-working example, to see if there's anything GPflow side that might need work in order to use scan? Thanks.

Bonnevie commented 7 years ago

@jameshensman, thanks, but it was just an issue with the standard list object - scan is designed to loop over tensors only, not common iterators, which the debug message didn't do a great job at emphasizing.

By the way, just to motivate my quest for compositionality, what I'm working on is similar to your chained Gaussian process methodology which is a pretty general setup for combining N latent GPs under one likelihood. What would be the idiomatic way to develop that in GPflow in your opinion? Using a variation on the above ProductModel class with a more extensive build_likelihood? Does the Likelihood class readily extend to multi-GP likelihoods?

jameshensman commented 7 years ago

Calling @alansaul

@Bonnevie , it should be easy enough to recreate the chained GP paper in GPflow. In fact, I've had this in mind for a while, and the way that kullback-leiblers and conditionals is split out should make this easy enough.

As you've recognised, the main bit of new code will be in the likelihood. I think the tidiest way to do this will be to make a new likelihood base class (MultiLikelihood, say) which echoes the Likelihood base class, but expects multiple latent functions -- perhaps concatenated, perhaps in a list.

You'll note that the chained GP requires a two-dimensional quadrature. @markvdw has already implemented this in the ekernels file, I would suggest splitting it out to make it reusable.

In the likelihoods class, the base class allows us to do quadrature, and inheriting classes, which implement specific likelihoods, can override this if the expressions turn out to be tractable. This pattern also appears in ekernels. I strongly suggest using it again.

I wonder if there's enough interest in this model to include it in GPflow... would you be interested in contributing, @Bonnevie ? @alexggmatthews , @markvdw , what do you think?

Bonnevie commented 7 years ago

@jameshensman Sure, I'd love to contribute!

If you want to integrate it into GPflow in a way that observes the code design, would something as simple as the ProductModel class be the appropriate container for several latent GPs? Or do you require additional functionality?

Actually, I see an issue with the above suggestion since all Models seem to assume an existing likelihood. I.e. there is no object for a GP prior.

A more hypothetical scenario (but one relevant to me) is a chained GP with a mix of fully latent GPs (the standard setup) and latent GPs with an auxiliary likelihood and auxiliary observations, the consequence of which is that you have a mix of latent GPs with and without observations before even taking the chained likelihood into account.

markvdw commented 7 years ago

Re: Multilikelihood. Yeah I would like to have this in GPflow. I already have some code that does this, but it would be worth rethinking it into making it neat.

jameshensman commented 7 years ago

Something like the productmodel example could work as a base class, yes. It would be nice to have mcmc and VB versions, like the other models in GPflow do.

As for your hypothetical scenario, I suggest doing this through a highly specialized likelihood. You can do something like this in GPs that do not have multiple latents(1) using the SwitchedLikelihood class, which lets you assign different likelihoods to different data. I think a similar thing could work for the chainedGP case: you could indicate to the likelihood that some observations are only dependent on one of the latent classes (with e.g. very small Gaussian noise).

This way, all GPflow users get the benefits of a chainedGP, and you only have to write a small amount of code that specializes it to your problem.

(1) more precisely, all GPflow models can currently have multiple latents, but they have to share kernels (and inducing points). This is how we deal with multiple columns of Y, and is something we inherited from as far back as Neil Lawrence's matlab toolbox, and I think is a good pattern.

alexggmatthews commented 7 years ago

Hi everyone,

I think it would be good to include Multilikelihood in GPflow with the usual caveats that it needs to long be neat and not have an adverse effect on the rest of the code. We probably would want to include the base ability to do Multilikelihood and a couple of the most important examples but not a multitude of special cases. The special cases would in imho be better left to users to do for their own projects.

@jameshensman it is worth noting that the MultiClass classification likelihood is already a case of this albeit with an analytic trick for reducing the quadrature down to 1D.

My understanding from the analysis in the chained GP paper is that without analytic tricks the dimensionality of the quadrature can be pushed to three or four dimensions but not much further. Is this about right? If so I suggest we have warnings to this effect just as we do in ekernels.

I would also be very interested to know what @alansaul thinks given his experience with the chained GP paper.

alansaul commented 7 years ago

Hey everyone,

I'm not super familiar with the GPflow codebase (though I intend to be in the near future, looks like a great package), but in the GPy implementation we used a specialised likelihood that takes multiple latent functions as James suggested.

In terms of quadrature, I have only used 2d Gauss-quadrature that appears to work well; I would say that with more latent functions it is likely that Monte Carlo sampling may be the way to go to approximately evaluate the variationalExpectations and its gradients. I assume that implementing sample based gradients for variationalExpectations in TF would be relatively simple, is this already implemented for any other models?

One of the problems with the chainedGP is that it can be a little difficult to optimise, as a moderate number of inducing inputs requires many variational parameters to be learnt, I find fixing kernel parameters and inducing points at the beginning of the optimisation helps here. I'd be interested to see a non-sparse version using a composition of VGP models, which I assume if a multi-likelihood is introduced, would be trivial, and require far less variational parameters (obviously only for smaller data though).