cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.54k stars 557 forks source link

[Docs] Inter-task kernel in SVGP setting #1733

Open mochar opened 3 years ago

mochar commented 3 years ago

I want to reproduce in gpytorch what is described in the section Use as an output cross-covariance in this paper.

In short, the authors define a kernel on graphs (i.e. the x-values are node indices). In this section they explain how we can use it in a coregionalization model to fit a GP f: R^d -> R^{|V|}, where |V| is the cardinality of the nodes. You therefore use the graph as prior knowledge on which features are likely to be more correlated.

This is actually exactly how the IndexKernel is used in the Hammard example. However I'd like this to work with SVGP as I have many tasks.

There are two open issues asking for the Hammard model in SVGP setting, here and here, with good suggestions from @gpleiss. However I'm not interested in modelling irregular inputs but rather having a kernel that acts on the features. Concretely, I'm asking how I can use such a kernel in a SVGP model.

Thanks!

gpleiss commented 3 years ago

However I'm not interested in modelling irregular inputs but rather having a kernel that acts on the features.

I am not sure what you mean by this. Can you elaborate more?

mochar commented 3 years ago

I am not sure what you mean by this. Can you elaborate more?

@gpleiss I have full set of inputs for every task so irregular inputs is not a concern. I would just like to use a kernel on the tasks, as described in the section of the paper I quoted.

gpleiss commented 3 years ago

This can be accomplished using the LMCVariationalStrategy here: https://docs.gpytorch.ai/en/stable/examples/04_Variational_and_Approximate_GPs/SVGP_Multitask_GP_Regression.html. Just set num_latents to be the same value as num_tasks. The forward method of your SVGP model should define k_Rd(x, x'), and the LMCVariationalStrategy implicitly defines k_G(I, I')

mochar commented 3 years ago

Thanks @gpleiss. I am indeed using LMCVariationalStrategy, good to know what I'm on the right track. However, in order to use the graph (nodes=tasks), I need to somehow pass the graph kernel that uses this graph to LMCVariationalStrategy, rather than using the implicit k_G(I, I') kernel. Is this possible? Furthermore would LMCVariationalStrategy be able to optimize the graph kernel parameters?

gpleiss commented 3 years ago

What is the specific form for k_G(I, I')? If it's a lookup table (which is essentially what IndexKernel is), then my suggestion above will work. If it has an actual functional form, then you probably would need to implement your own variational strategy, similar to LMCVariationalStrategy.

mochar commented 3 years ago

@gpleiss (should I @ or is that annoying? not sure how this works) Yes it has a functional form. It essentially uses the scaled graph laplacian as the covariance matrix (equation 12 in the paper). Any pointers on how I would go about implementing my own variational strategy to use a custom kernel? Would it be as simple as re-implementing the MultitaskKernel forward method within MyVariationalStrategy.__call__, but using the scaled graph laplacian rather than self.task_covar_module.covar_matrix?

gpleiss commented 3 years ago

@ is fine 😄

I'm not sure what we have right now in GPytorch will work for you, and I'm also wondering whether a SVGP strategy is the right call. Our multi-task variational methods are written assuming that you will have ~5-20 tasks or latent dimensions. They will be inefficient if you have |V| of them (here I'm assuming |V| is in the 100s or 1000s).

However, I think it'll take some significant thought for how to get a SVGP-style model to work in this case. Your inducing points will need to live in the R^d \times V space (where d is the dimensionality of an input x). SinceV` is a discrete space, you'd probably want one set of inducing points for each node, which I imagine would get very computationally expensive if you have lots of nodes.

wjmaddox commented 3 years ago

So the strategy proposed in the matern graph kernel paper is to do an interdomain inducing point strategy where the inducing points are the eigenvalues / vectors of the covariance applied to the adjacency matrix of the graph. I can try to think about how to implement something like that but in general, I don't really know how to do interdomain inducing strategies in gpytorch right now.

@mochar Would the graph be small enough to store in memory? If so, then we could probably put something up that doesn't sub-sample / e.g. have inducing points across the graph.

mochar commented 3 years ago

@gpleiss I have run the multi-task SVGP-LVM on ~1000 tasks and it seems to work fine. While messing around with the model I use a LCM or train only one batch of inducing points to make it go faster, but the results seem ok. In regards to your previous suggesting to use a custom variational strategy: From looking at the code of the VariationalStrategy implementations I gathered that its task is to transform the batches of multivariate normals to one multitask-multivariate normal. Would it not be possible to use the scaled graph laplacian (the covariance matrix of the tasks) here, and make the parameters by which we scale the laplacian trainable, similar to the LMC coefficients in LMCVariationalStrategy? I wouldn't know how to use the task covariances to transform the multivariate normals to a multitask-multivariate normal - perhaps you have an idea?

@wjmaddox I am not sure if that's correct for the section I copied. It explains that the input/inducing points are in R^d. The data kernel therefore can be eg the RBF kernel. It is the task kernel that uses the graph Laplacian. Not directly though, you take the (truncated) eigenvectors & eigenvalues, which you then use to retrieve back the (approximate) laplacian. The reason for this is that the scaling should be applied on the eigenvalues. If you do not optimize the parameters, you can just use the (approximate) laplacian as the task kernel covariance. And yes, the graph (i.e. eigenvectors + eigenvalues of the graph laplacian) is definitely small enough to store in memory.

The graph kernel as explained in the main paper, i.e. the data points X are graph nodes, I have implemented as well. I can share my implementation sometime when I have the time.

gpleiss commented 2 years ago

Would it not be possible to use the scaled graph laplacian (the covariance matrix of the tasks) here, and make the parameters by which we scale the laplacian trainable, similar to the LMC coefficients in LMCVariationalStrategy?

This is the basic idea, yes. Note that, with LMCVariationalStrategy, the inter-task covariance matrix is defined by lmc_coefficients @ lmc_coefficients.T. So similarly, you would want to replace lmc_coefficients with the Cholesky/square root of the graph laplacian.

mochar commented 2 years ago

@gpleiss Regarding your suggestion to replace lmc_coefficients with the cholesky of the graph laplacian: The cholesky will be TxT (T=number of tasks), doesn't that mean I have no choice but to fit T latent GPs as well? Also, since it's lower triangular, wouldn't that increasingly deactivate more latent GPs, as the posterior mean of a task is a linear combination of the latent GPs with the values in the cholesky rows as the weights.

gpleiss commented 2 years ago

@mochar sorry for the slow reply!

The cholesky will be TxT (T=number of tasks), doesn't that mean I have no choice but to fit T latent GPs as well?

Yes

Also, since it's lower triangular, wouldn't that increasingly deactivate more latent GPs, as the posterior mean of a task is a linear combination of the latent GPs with the values in the cholesky rows as the weights.

Yes. You could also use the matrix square root of the graph laplacian instead, which would make things not lower triangular. In practice however, I've found that lower triangular matrices often accelerate convergence.