cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 561 forks source link

[Bug?] Much larger variance with MultiTask kernel compared to Independent Model List #1834

Open ishank-juneja opened 2 years ago

ishank-juneja commented 2 years ago

🐛 Bug

On training a MultiTask kernel based model and a collection of independent models tied together in an independent model list object on the same dataset, I see variance magnitudes that are orders of magnitude different. It is unclear why this is the case since the model parameters common to the 2 learnt models (The MultiTask model MTmodel and the Independent Model List model) seem to quite similar.

To reproduce

Code snippet to reproduce

import torch
import gpytorch
import numpy as np

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=4))

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        # https://docs.gpytorch.ai/en/stable/means.html#multitaskmean
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        # Composition of index kernel and RBF kernel
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(ard_num_dims=4), num_tasks=2, rank=2
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        # https://docs.gpytorch.ai/en/stable/distributions.html#multitaskmultivariatenormal
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x, interleaved=False)

# Number of train samples
nsamps = 1000
# Fix seed for reproducability
np.random.seed(10)
# Joint input space tensor (u_t, x_t) to hold all inputs and trajectories
train_x = torch.tensor(np.random.uniform(low=-1.0, high=1.0, size=(nsamps, 4))).float()

# Generate output samples
# A and B matrices
A = torch.tensor([[1., 0.],
                  [0., 1.]])
B = torch.tensor([[-0.2, 0.1],
                 [0.15, 0.15]])
# Output states starting time index 1, no observation noise
train_y = torch.zeros(nsamps, 2)
# Obtain the output states $(x_{t+1, 1}, x_{t+1, 2})$
for i in range(nsamps):
    # Get the output next state
    x_next = torch.matmul(A, train_x[i, 2:4]) + torch.matmul(B, train_x[i, 0:2])
    # No observation noise added
    train_y[i, :] = x_next

# dataset = torch.cat([train_x, train_y], dim=1)

likelihood1 = gpytorch.likelihoods.GaussianLikelihood()
model1 = ExactGPModel(train_x, train_y[:, 0], likelihood1)

likelihood2 = gpytorch.likelihoods.GaussianLikelihood()
model2 = ExactGPModel(train_x, train_y[:, 1], likelihood2)

# Collect the sub-models in an IndependentMultiOutputGP, and the respective likelihoods in a MultiOutputLikelihood
model = gpytorch.models.IndependentModelList(model1, model2)
likelihood = gpytorch.likelihoods.LikelihoodList(model1.likelihood, model2.likelihood)

mll = gpytorch.mlls.SumMarginalLogLikelihood(likelihood, model)

# Perform Ind. Model List Training
training_iterations = 50
# Find optimal model hyper-parameters
model.train()
likelihood.train()
# Use the Adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes all submodel and all likelihood parameters
], lr=0.1)
print("Training Ind. Model List\n- - - - - - - - - - ")
for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(*model.train_inputs)
    loss = -mll(output, model.train_targets)
    loss.backward()
    if (i + 1) % 5 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
    optimizer.step()
print("- - - - - - - - - - ")

# MTGaussianLikelihood allows for modelling a full 2x2 Noise Cov. Prior
MTlikelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
MTmodel = MultitaskGPModel(train_x, train_y, MTlikelihood)

training_iterations = 50

# Find optimal MTmodel hyperparameters
MTmodel.train()
MTlikelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': MTmodel.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(MTlikelihood, MTmodel)
print("Training MT Model\n- - - - - - - - - - ")
for i in range(training_iterations):
    optimizer.zero_grad()
    output = MTmodel(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    if (i + 1) % 5 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
    optimizer.step()
print("- - - - - - - - - - ")

# View the parameters (and others specific to MT)-
# (1) Learned value of observation-noise covariance
# (2) Learned constant mean prior
# (3) Learned kernel scale parameter (\sigma)
# (4) Learned kernel length scale (\ell)

# Output 1 y_{a}
print("- - - - - - - - - \nModel 1a\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[0].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[0].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[0].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[0].covar_module.base_kernel.lengthscale)

# Output 2 y_{b}
print("- - - - - - - - - \nModel 1b\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[1].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[1].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[1].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[1].covar_module.base_kernel.lengthscale)

# MT Model
print("- - - - - - - - - \nModel 2 (MultiTask=MT)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(MTmodel.likelihood.noise)
print("Learned constant mean for the prior comp. 1")
print(MTmodel.mean_module.base_means[0].constant)
print("Learned constant mean for the prior comp. 2")
print(MTmodel.mean_module.base_means[1].constant)
print("Learned static Index Kernel/Fixed Covariance K_{TT} matrix")
print(MTmodel.covar_module.task_covar_module.covar_factor)
print("Learned kernel length scales (one for each input) \ell")
print(MTmodel.covar_module.data_covar_module.lengthscale)

# Set models into eval mode
model.eval()
MTmodel.eval()
likelihood.eval()
MTlikelihood.eval()

# Shift this distance away from a train data point
shift = 0.05

# Train data-point
ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    # Get distributions of type multivariate-normal
    prediction_dist1 = likelihood(*model(ux1, ux1))
    prediction_dist2 = likelihood(*model(ux2, ux2))
    # Get distribution of type multi-task multi-variate normal\
    prediction_dist3 = MTlikelihood(MTmodel(ux1))
    prediction_dist4 = MTlikelihood(MTmodel(ux2))

print("Indp Model List Mean and Variance on a Train Point")
print('mean: ', prediction_dist1[0].mean.detach().numpy(), prediction_dist1[1].mean.detach().numpy())
print('vars: ', prediction_dist1[0].covariance_matrix.detach().numpy(), prediction_dist1[1].covariance_matrix.detach().numpy())
print('------')
print("Indp Model List Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist2[0].mean.detach().numpy(), prediction_dist2[1].mean.detach().numpy())
print('vars: ', prediction_dist2[0].covariance_matrix.detach().numpy(), prediction_dist2[1].covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance on a Train Point")
print('mean: ', prediction_dist3.mean.detach().numpy())
print('vars:\n', prediction_dist3.covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist4.mean.detach().numpy())
print('vars:\n', prediction_dist4.covariance_matrix.detach().numpy())
print('------')
print("Actual Data Point (True Label)")
print(train_y[1:2, :])

Stack trace/error message

Training Ind. Model List
- - - - - - - - - - 
/home/ishank/Desktop/Gaussian-Process-Dynamics/venv/lib/python3.8/site-packages/gpytorch/utils/linear_cg.py:266: UserWarning: An output with one or more elements was resized since it had shape [11], which does not match the required output shape [1, 11].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at  ../aten/src/ATen/native/Resize.cpp:23.)
  _jit_linear_cg_updates_no_precond(
Iter 5/50 - Loss: 0.652
Iter 10/50 - Loss: 0.430
Iter 15/50 - Loss: 0.200
Iter 20/50 - Loss: -0.039
Iter 25/50 - Loss: -0.280
Iter 30/50 - Loss: -0.533
Iter 35/50 - Loss: -0.793
Iter 40/50 - Loss: -1.047
Iter 45/50 - Loss: -1.304
Iter 50/50 - Loss: -1.555
- - - - - - - - - - 
Training MT Model
- - - - - - - - - - 
Iter 5/50 - Loss: 0.991
Iter 10/50 - Loss: 0.768
Iter 15/50 - Loss: 0.540
Iter 20/50 - Loss: 0.301
Iter 25/50 - Loss: 0.055
Iter 30/50 - Loss: -0.197
Iter 35/50 - Loss: -0.454
Iter 40/50 - Loss: -0.711
Iter 45/50 - Loss: -0.968
Iter 50/50 - Loss: -1.224
- - - - - - - - - - 
- - - - - - - - - 
Model 1a
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0137], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3450, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.4328, 3.5159, 1.5461, 3.3662]], grad_fn=<SoftplusBackward0>)
- - - - - - - - - 
Model 1b
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0089], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3588, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.3585, 3.3732, 3.5119, 1.6963]], grad_fn=<SoftplusBackward0>)
- - - - - - - - - 
Model 2 (MultiTask=MT)
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior comp. 1
Parameter containing:
tensor([0.0083], requires_grad=True)
Learned constant mean for the prior comp. 2
Parameter containing:
tensor([-0.0142], requires_grad=True)
Learned static Index Kernel/Fixed Covariance K_{TT} matrix
Parameter containing:
tensor([[-0.1277,  0.5546],
        [ 0.6061,  0.2162]], requires_grad=True)
Learned kernel length scales (one for each input) \ell
tensor([[3.3520, 3.3918, 2.7205, 2.7693]], grad_fn=<SoftplusBackward0>)
Indp Model List Mean and Variance on a Train Point
mean:  [-0.66263324] [0.43995908]
vars:  [[0.00555626]] [[0.00556101]]
------
Indp Model List Mean and Variance Nearby a Train Point
mean:  [-0.61773294] [0.5056697]
vars:  [[0.00555528]] [[0.00555663]]
------
MT-Model Mean and Variance on a Train Point
mean:  [[-0.69353205  0.44850466]]
vars:
 [[0.56116694 0.04254041]
 [0.04254041 0.6265246 ]]
------
MT-Model Mean and Variance Nearby a Train Point
mean:  [[-0.6492302   0.51464087]]
vars:
 [[0.56116694 0.04254041]
 [0.04254041 0.6265246 ]]
------
Actual Data Point (True Label)
tensor([[-0.6583,  0.4381]])

Expected Behavior

The covariance matrix of the posterior obtained from the MultiTask kernel model is strangely frozen on- [0.56116694, 0.04254041 0.04254041, 0.6265246 ], For both the train data point and a shifted version of it.

I find 2 problems with the covariance Matrix obtained from the MultiTask version.

  1. The diagonal entries which are the variances are quite large. Much larger than the corresponding variances obtained from the Independent Model List. Also it is strange to me that this be the case since the K{TT} index kernel matrix associated with the MultiTask kernel has entries smaller than 1.0 (as it should since it is a static matrix of correlation coefficients is my understanding) and the length scale parameters, and the noise variance values (in the range of 1.5-3.5 and 0.0056 respectively) are very similar for the MT-model and the Ind. Model-List. Since the final variance is the composition of K{TT} and an RBF kernel, I would have expected the variances from the 2 models to have the same order of magnitude...
  2. They remain completely unchanged between these 2 points (and on other test points for that matter)

System information

Please complete the following information:

Additional context

Colab Notebook Version- https://colab.research.google.com/drive/1OalLncVeGtNHh-DqjnkScfy46uTtNud_?usp=sharing

wjmaddox commented 2 years ago

This is actually not quite a bug but seems to be caused by your usage of fast_pred_var in the inference step which isn't strictly necessary here.

If I remove that flag (as well as removing the effect of the likelihood), then I get sensible outputs:

ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad():
    # Get distributions of type multivariate-normal
    # prediction_dist1 = likelihood(*model(ux1, ux1))
    # prediction_dist2 = likelihood(*model(ux2, ux2))
    prediction_dist1 = model(ux1, ux1)
    prediction_dist2 = model(ux2, ux2)
    # Get distribution of type multi-task multi-variate normal\
#     prediction_dist3 = MTlikelihood(MTmodel(ux1))
#     prediction_dist4 = MTlikelihood(MTmodel(ux2))
    prediction_dist3 = MTmodel(ux1)
    prediction_dist4 = MTmodel(ux2)
MT-Model Mean and Variance on a Train Point
mean:  [[-0.6976632  0.4489181]]
vars:
 [[1.15394592e-04 1.92224979e-06]
 [1.93342566e-06 1.18136406e-04]]
------
MT-Model Mean and Variance Nearby a Train Point
mean:  [[-0.65350515  0.5149511 ]]
vars:
 [[1.10507011e-04 1.90734863e-06]
 [1.90734863e-06 1.13129616e-04]]
------
Actual Data Point (True Label)
tensor([[-0.6583,  0.4381]])

If you're wondering why exactly this fixes it, here's what I was able to trace down:

Numerically, what was happening is that the second term in the posterior covariance was dropping to zero due to something happening in the Lanczos decomposition, aka (K_{x_test, X} \kron K_{TT})(K_{XX} \kron K_{TT} + \Sigma)^{-1}(K_{x_test, X} \kron K_{TT}) \approx 0.

Then, the predictive variance was determined entirely by the first term and the intertask covariance matrix, which was actually matching the predictive variances.

I'll doublecheck the fast_pred_var to see why these predictive variances were so far off and if there's a bug there, but in the meantime, it's probably best to not use fast_pred_var if you don't have that many test points.

ishank-juneja commented 2 years ago

I agree that fast_pred_var need not be used here, but what would be the justification for not using the likelihood? Don't we need the likelihood to get the right posterior distribution over the GPs prediction?

wjmaddox commented 2 years ago

That was mostly a tool for debugging what's going on, you ought to be able to use either here.

In general, it depends on if you want the posterior over the latent function given the data, or the posterior over the function values given the data. They'll only differ by the size of the variance (for gaussian observations).

ishank-juneja commented 2 years ago

I am not sure if #864 is related but I thought it was worth mentioning here since it too has to do with incorrect variances from the use of gpytorch.settings.fast_pred_var()

ishank-juneja commented 2 years ago

That was mostly a tool for debugging what's going on, you ought to be able to use either here.

In general, it depends on if you want the posterior over the latent function given the data, or the posterior over the function values given the data. They'll only differ by the size of the variance (for gaussian observations).

I understand this now assuming that you mean "observations" y = f(x) + eps when you refer to function values

wjmaddox commented 2 years ago

Yes, that's the correct understanding.

ishank-juneja commented 2 years ago

@wjmaddox thank you for your kind help so far, I wanted to follow up about the MultiTask kernel issues I have been experiencing. I need to model correlation between the two (or more later) outputs and the MultiTask kernel seems like the way to go about doing that.

Here is an updated version of the colab notebook I had shared earlier. I have now removed the fast-predictive variance flag, and as you pointed out in your initial reply on this thread the variance of the distribution of the latent function given the data (i.e. without likelihood) are quite similar for both the Multi-Task (MT) kernel and the Independent-Model-List (IML).

However, I am still not sure if I am doing/understanding everything correctly about the usage of the MultiTask kernel because The following facts about the trained models confuse me (Here model is the IML and MTmodel uses the MT Kernel),

print("- - - - - - - - - \nModel 1a (IML)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[0].likelihood.noise_covar.noise)
- - - - - - - - - 
Model 1a
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)

print("- - - - - - - - - \nModel 1b (IML)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[1].likelihood.noise_covar.noise)
- - - - - - - - - 
Model 1b
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)

print("- - - - - - - - - \nModel 2 (MultiTask=MT)\n- - - - - - - - - ")
print("Learned Noise Covariance")
- - - - - - - - - 
Model 2 (MultiTask=MT)
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0054], grad_fn=<AddBackward0>)

Why are the learned noise variances (\sigma_n^2) for both the IML and MT kernel model likelihoods so similar ([0.0055, 0.0055] and 0.0054 respectively), but the effect of passing the model predictions (i.e. the latent function) through the two likelihoods (Gaussian Likelihood List in case of IML and MT-Gaussian Likelihood in case of MT Kernel model) so different?