cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.54k stars 556 forks source link

[Feature Request] Using Custom Lazy Tensors #777

Closed g-benton closed 5 years ago

g-benton commented 5 years ago

🚀 Feature Request

I don't know if this is done already but I haven't been able to get this to work as of yet. I would like to be able to write a custom lazy tensor class with its own inv_quad_logdet and have that be used as the covariance matrix for likelihood(model(train_x)) calls.

Motivation

Given the structure of my data I know that my covariance matrix produced by likelihood(model(train_x)) will decompose into a Kronecker matrix plus a scaled identity matrix (noise term) which allows for fast computations of inverse quadratic forms and log determinants (see Appendix Section 3 of http://www.cs.cmu.edu/~andrewgw/manet.pdf). The code to write a kernel wrapper class and the lazy tensor that computes fast log dets is fairly straightforward, but I'm unsure of how to connect to the GPyTorch internals so that the custom method for inv_quad_logdet is used.

This is pretty closely related to https://github.com/cornellius-gp/gpytorch/issues/725, so maybe a similar solution could be used.

jacobrgardner commented 5 years ago

@g-benton I'm not sure what you'd like here beyond what already exists in the package. A kronecker matrix plus a scaled identity matrix can be represented by a AddedDiagLazyTensor(KroneckerProductLazyTensor, DiagLazyTensor)) structure, which will use CG for fast inference.

The Kronecker specific methods are slightly faster in #725 on the CPU for relatively small matrices, but it's not clear that it's a lot faster on the GPU. What is currently implemented should be pretty fast on the GPU, and is much faster than the naive O(m^2n^2) way of doing things.

If you want to specifically write a new LazyTensor still, the necessary steps are pretty well documented in the LazyTensor docs. You don't specifically need to overwrite inv_quad_logdet, although you can if you'd like. Either way, once you've written a new LazyTensor, all you need to do to use it in GPyTorch is write a kernel that returns one.

andrewgordonwilson commented 5 years ago

In the pure Kronecker case, without missing data, the fastest way to do inference and learning requires only eigendecompositions of the constituent matrices and a few Kronecker MVMs, rather than CG. This would be useful functionality for spatiotemporal regression problems, which are often exactly on a Cartesian product grid.

g-benton commented 5 years ago

Here's a small chunk of code that hopefully explains further why we would want to use a custom inv_quad_logdet function (which I believe requires getting a custom lazy tensor returned by likelihood(model(train_x))).

Using a grid_size of 100 everything is fine, but when you increase grid_size to 1000 the custom method is dramatically faster.

I'm splitting up the Kronecker and noise components to make it work here, but in practice we would want to use a custom lazy tensor that held both the Kronecker and noise components.

import torch
import math
import gpytorch

grid_bounds = [(0, 1), (0, 1)]
grid_size = 1000
grid = torch.zeros(grid_size, len(grid_bounds))
for i in range(len(grid_bounds)):
    grid_diff = float(grid_bounds[i][1] - grid_bounds[i][0]) / (grid_size - 2)
    grid[:, i] = torch.linspace(grid_bounds[i][0] - grid_diff, grid_bounds[i][1] + grid_diff, grid_size)

train_x = gpytorch.utils.grid.create_data_from_grid(grid)
train_y = torch.sin((train_x[:, 0] + train_x[:, 1]) * (2 * math.pi)) + torch.randn_like(train_x[:, 0]).mul(0.01)

from gpytorch.means import ConstantMean
from gpytorch.kernels import ScaleKernel, RBFKernel, ProductStructureKernel, GridInterpolationKernel
from gpytorch.kernels import GridKernel
from gpytorch.distributions import MultivariateNormal

class GPRegressionModel(gpytorch.models.ExactGP):
    def __init__(self,grid, train_x, train_y, likelihood):
        super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = ConstantMean()
        self.base_covar_module = RBFKernel()
        self.covar_module = GridKernel(self.base_covar_module,
                                       grid = grid)

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

lh = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(grid, train_x, train_y, lh)

# The standard method works for smaller sizes of data

cov = lh(model(train_x)).lazy_covariance_matrix
cov.inv_quad_logdet(logdet=True)

# Using a custom method we can run this on up to 1,000,000 data points

def logdet(kron, noise):
    res = 0
    sub_eigs = []
    for lt in kron.lazy_tensors:
        sub_eigs.append(lt.evaluate().eig()[0][:, 0].unsqueeze(-1))

    eigs = sub_eigs[0].matmul(sub_eigs[1].t())
    return torch.log(eigs + noise).sum()

noise = lh.noise.data
kron = model.covar_module.forward(train_x, train_x)
print(logdet(kron, noise))
jacobrgardner commented 5 years ago

Yeah, so this issue just boils down to the fact that it is faster to run CG on individual submatrices rather than use the matmul algorithm for the full Kronecker product to run on the full matrix.

In general you can do much better than decomposing the sub matrices because you can run CG on individual sub matrices. @mshvartsman's snippet from #725 computes inv_quad_logdet for KroneckerProductLazyTensor using the inv_matmul and logdet calls of the individual sub matrices:

    def inv_quad_logdet(self, inv_quad_rhs=None, logdet=False, reduce_inv_quad=True):

        nrows, ncols = inv_quad_rhs.shape

        nfactors = len(self.lazy_tensors)

        logdets = torch.zeros(nfactors)
        tensor_shapes = [q.shape[0] for q in self.lazy_tensors]
        multipliers = torch.Tensor([torch.prod(torch.Tensor([ts for i, ts in enumerate(tensor_shapes) if i!=j]))
                                for j in range(len(tensor_shapes))])

        y = inv_quad_rhs.clone()
        for i, q in enumerate(self.lazy_tensors):
            n = q.shape[0]
            # ideally we could just get inv_matmul and logdet from the one call instead of
            # two here, reusing solves the same way we do in usual InvQuadLogDet
            y = q.inv_matmul(y.contiguous().view(n, -1))
            logdets[i] = q.logdet()
            y = y.view(n, nrows//n, -1).permute(1, 0, 2)

        logdet_term = torch.sum(logdets * multipliers)

        inv_quad_term = torch.trace(inv_quad_rhs.t() @ y.contiguous().view(nrows, ncols))

        return inv_quad_term, logdet_term

Which can replace the existing KroneckerProductLazyTensor.inv_quad_logdet once it's just modified to support batch mode operations (and to not perform the logdet operation if it's not asked for).

g-benton commented 5 years ago

So I think where this differs is that we want to compute the log det of K + \sigma^2 I (not just of K) where K is a Kronecker. So even if KroneckerProductLazyTensor.inv_quad_logdet is updated re: issue #725 that won't be exploited when calls are made like likelihood(model(train_x)) correct? Because then the covariance is then stored as an AddedDiagLazyTensor?

jacobrgardner commented 5 years ago

@g-benton Right, okay I agree that in the exact GP setting you might get more efficiency out of something like a KroneckerPlusDiagLazyTensor.

To return to the original question for this issue, what specific problems are you running in to writing a new LazyTensor? If I understand you correctly, you have an extended LazyTensor and kernel, and the only issue you are having is that likelihood(model(x)) is overwriting your lazy tensor with an AddedDiagLazyTensor(YourLazyTensor, DiagLazyTensor(noise)).

Here's what a potentially decent solution would be:

  1. Have your kernel return a KroneckerPlusDiagLazyTensor (or whatever) with diag = None initially.
  2. In your lazy tensor, override LazyTensor.__add__: https://github.com/cornellius-gp/gpytorch/blob/8cc223b6c6a108de21b1c8aa9336cd1c7c2d4779/gpytorch/lazy/lazy_tensor.py#L1581 so that if we add a DiagLazyTensor to your KroneckerPlusDiagLazyTensor, you get a new KroneckerPlusDiagLazyTensor back out instead of an AddedDiagLazyTensor.

The only other consideration is ScaleKernel. In the very short term, you may also have to handle what ConstantMulLazyTensor currently does if you use a scale kernel, e.g. instead of using a scale kernel handle the constant in your own kernel and LT. However (cc @gpleiss), we should probably try to get away with having ConstantMulLazyTensor use the base LT's inv_quad_logdet. If we did this, then you'd be able to wrap your custom kernel with a ScaleKernel.

g-benton commented 5 years ago

@jacobrgardner This looks exactly like what I was looking for - I don't think scaling will be an issue for our use case so hopefully that's not a problem. Thank you!