cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.54k stars 557 forks source link

Deep Kernel Transfer Regression with spectral kernel [Error] #1745

Closed ZohrehAdabi closed 3 years ago

ZohrehAdabi commented 3 years ago

Hi,

I'm running the DKT code for few-shot regression on QMUL dataset [DKT code]. For RBF kernel the code runs but for spectral kernel, I gets this error at self.model(z):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ghiasi/anaconda3/lib/python3.8/site-packages/gpytorch/models/exact_gp.py", line 256, in __call__
    raise RuntimeError("You must train on the training inputs!")
RuntimeError: You must train on the training inputs!

Is this problem related to gpytorch update or not? Thanks in advance.

wjmaddox commented 3 years ago

This is because you're trying to train the model (aka using the marginal log likelihood) without putting the model in train.

Try calling model.train() in your code.

ZohrehAdabi commented 3 years ago

This is because you're trying to train the model (aka using the marginal log likelihood) without putting the model in train.

Try calling model.train() in your code.

I set the model in train mode but the error still exist. In first iteration this is value of z

tensor([[0.0229, 0.0217, 0.0258,  ..., 0.0096, 0.0091, 0.0084],
        [0.0206, 0.0279, 0.0259,  ..., 0.0043, 0.0091, 0.0085],
        [0.0222, 0.0269, 0.0273,  ..., 0.0000, 0.0060, 0.0090],
        ...,
        [0.0227, 0.0229, 0.0208,  ..., 0.0046, 0.0093, 0.0043],
        [0.0280, 0.0261, 0.0273,  ..., 0.0122, 0.0043, 0.0055],
        [0.0285, 0.0246, 0.0249,  ..., 0.0046, 0.0056, 0.0045]],
       device='cuda:0', grad_fn=<ViewBackward>)

but the kernel is

self.model.covar_module(z)
tensor([[inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]],
       device='cuda:0', grad_fn=<ProdBackward1>)

therefore loss= -self.mll(predictions, self.model.train_targets) gets nan then in the next iteration z is nan and the "RuntimeError: You must train on the training inputs!" appears. Why the kernel value is nan?

wjmaddox commented 3 years ago

Can you post a full example?

ZohrehAdabi commented 3 years ago

Can you post a full example?

I faced the same error in this example when I used the feature_extractor(inputs) rather than random z.

import torch.nn as nn
import gpytorch
import torch.nn.functional as F
class Conv3(nn.Module):
    def __init__(self):
        super(Conv3, self).__init__()
        self.layer1 = nn.Conv2d(3, 36, 3,stride=2,dilation=2)
        self.layer2 = nn.Conv2d(36,36, 3,stride=2,dilation=2)
        self.layer3 = nn.Conv2d(36,36, 3,stride=2,dilation=2)

    def return_clones(self):
        layer1_w = self.layer1.weight.data.clone().detach()
        layer2_w = self.layer2.weight.data.clone().detach()
        layer3_w = self.layer3.weight.data.clone().detach()
        return [layer1_w, layer2_w, layer3_w]

    def assign_clones(self, weights_list):
        self.layer1.weight.data.copy_(weights_list[0])
        self.layer2.weight.data.copy_(weights_list[1])
        self.layer3.weight.data.copy_(weights_list[2])

    def forward(self, x):

        out = F.relu(self.layer1(x))
        out = F.relu(self.layer2(out))
        out = F.relu(self.layer3(out))
        out = out.view(out.size(0), -1)
        return out

class ExactGPLayer(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, kernel='linear'):
        super(ExactGPLayer, self).__init__(train_x, train_y, likelihood)
        self.mean_module  = gpytorch.means.ConstantMean()

        ## RBF kernel
        if(kernel=='rbf' or kernel=='RBF'):
            self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        ## Spectral kernel
        elif(kernel=='spectral'):
            self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4, ard_num_dims=2916)
        else:
            raise ValueError("[ERROR] the kernel '" + str(kernel) + "' is not supported for regression, use 'rbf' or 'spectral'.")

    def forward(self, x):
        mean_x  = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

class DKT(nn.Module):
    def __init__(self, backbone):
        super(DKT, self).__init__()
        ## GP parameters
        self.feature_extractor = backbone.cuda()
        self.get_model_likelihood_mll() #Init model, likelihood, and mll

    def get_model_likelihood_mll(self, train_x=None, train_y=None):
        if(train_x is None): train_x=torch.ones(19, 2916).cuda()
        if(train_y is None): train_y=torch.ones(19).cuda()

        likelihood = gpytorch.likelihoods.GaussianLikelihood()
        model = ExactGPLayer(train_x=train_x, train_y=train_y, likelihood=likelihood, kernel='spectral')

        self.model      = model.cuda()
        self.likelihood = likelihood.cuda()
        self.mll        = gpytorch.mlls.ExactMarginalLogLikelihood(self.likelihood, self.model).cuda()
        self.mse        = nn.MSELoss()

        return self.model, self.likelihood, self.mll

    def train_loop(self, optimizer):

        self.model.train()
        self.feature_extractor.train()
        self.likelihood.train()
        for i in range(5):
            inputs=torch.rand(19, 3, 100, 100).cuda()
            labels=torch.rand(19).cuda()
            z = self.feature_extractor(inputs)
            self.model.set_train_data(inputs=z, targets=labels)
            predictions = self.model(z)
            loss = -self.mll(predictions, self.model.train_targets)

            loss.backward()
            optimizer.step()

model = DKT(Conv3())
optimizer = torch.optim.Adam([{'params': model.model.parameters(), 'lr': 0.001},
                        {'params': model.feature_extractor.parameters(), 'lr': 0.001}])
for e in range(5):
    model.train_loop(optimizer)
wjmaddox commented 3 years ago

Can you share more details about your system setup?

I was getting cholesky / nan errors until I added the following line into your gp class model. It was also working fine for me with the rbf kernel.

class ExactGPLayer(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, kernel='linear'):
        super(ExactGPLayer, self).__init__(train_x, train_y, likelihood)
        self.mean_module  = gpytorch.means.ConstantMean()

        ## RBF kernel
        if(kernel=='rbf' or kernel=='RBF'):
            self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        ## Spectral kernel
        elif(kernel=='spectral'):
            self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4, ard_num_dims=2916)
            # NEW LINE ADDED FOR STABILITY
            self.covar_module.initialize_from_data_empspect(train_x, train_y)
        else:
            raise ValueError("[ERROR] the kernel '" + str(kernel) + "' is not supported for regression, use 'rbf' or 'spectral'.")

    def forward(self, x):
        mean_x  = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
ZohrehAdabi commented 3 years ago

Can you share more details about your system setup?

I was getting cholesky / nan errors until I added the following line into your gp class model. It was also working fine for me with the rbf kernel.

class ExactGPLayer(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, kernel='linear'):
        super(ExactGPLayer, self).__init__(train_x, train_y, likelihood)
        self.mean_module  = gpytorch.means.ConstantMean()

        ## RBF kernel
        if(kernel=='rbf' or kernel=='RBF'):
            self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        ## Spectral kernel
        elif(kernel=='spectral'):
            self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4, ard_num_dims=2916)
            # NEW LINE ADDED FOR STABILITY
            self.covar_module.initialize_from_data_empspect(train_x, train_y)
        else:
            raise ValueError("[ERROR] the kernel '" + str(kernel) + "' is not supported for regression, use 'rbf' or 'spectral'.")

    def forward(self, x):
        mean_x  = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

I run code on Linux, python 3.8.5, gpytorch 1.5.0, pytorch 1.9.0. Thank you very much. I added the line you have added and now it runs. Is the issue just related to Initializing of the kernel?

wjmaddox commented 3 years ago

Probably so, initialization of SM kernels plays a pretty huge role in performance, hence why there is that method.

ZohrehAdabi commented 3 years ago

Thanks for the note about SM kernel and your quick response.