cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 562 forks source link

Deep Gaussian Process for classification #1015

Open anh-tong opened 4 years ago

anh-tong commented 4 years ago

Hi,

I wonder if there are any examples for Deep GP for classification.

I build a code that followed the example for regression. However, in classification tasks, it seems like the model cannot be learned. There are some issues with Deep GP with very deep layers (https://arxiv.org/abs/1402.5836). In this case, the number of layers is just 2.

Here is the example code. https://colab.research.google.com/drive/1bVcFLVcMdOQm2fo0AiwR-kZOQQoSllm3

Thanks.

anh-tong commented 4 years ago

Actually, single-layer variational GPs just have 10% accuracy on MNIST.

https://colab.research.google.com/drive/1jxCUKIgZHbNtkE5qttcgmnQv7O-bya4R

Do you have any idea what's wrong with this code? Is this because of SoftmaxLikelihood?

gpleiss commented 4 years ago

This is not a bug with GPyTorch. The problem is that it is difficult to optimize the inducing point locations for a 784 dimensional function - especially when they are randomly initialized.

Try initializing the inducing points of the single-layer SVGP (or the first layer of the deep GP) to be some training data samples, or run k-means clustering with num_inducing clusters and use the cluster centroids as inducing point initializations. When I do this running your example the loss goes down rapidly after ~2 epochs.

JanSochman commented 4 years ago

@anh-tong have you succeeded in making your model to train? I arrived at exactly the same situation today and even with adding kmeans initialisation I wasn't able to make the model converge... Would you mind to share how exactly do you do the initialisation?

gpleiss commented 4 years ago

@JanSochman just try initializing the inducing points to a random subset of data first. That'll get you most of the way there.

JanSochman commented 4 years ago

That was my first try before trying kmeans :) I passed them through Normal_DeepGP into the first layer and initialised the inducing points there. Gives me 11% accuracy after 10 epochs of stagnation at this value.

This is the affected part:

class HiddenLayer(DeepGPLayer):

    def __init__(self, input_dims, output_dims, num_inducing=128, init_inducing_points=None):
        if output_dims is None:
            inducing_points = torch.randn(num_inducing, input_dims)
            if torch.cuda.is_available():
                inducing_points = inducing_points.cuda()
            batch_shape = torch.Size([])
        else:
            if init_inducing_points is None:
                inducing_points = torch.randn(output_dims, num_inducing, input_dims)
            else:
                inducing_points = init_inducing_points.expand(output_dims, -1, -1)   # <-- HERE
            if torch.cuda.is_available():
                inducing_points = inducing_points.cuda()
            batch_shape = torch.Size([output_dims])
anh-tong commented 4 years ago

@anh-tong have you succeeded in making your model to train? I arrived at exactly the same situation today and even with adding kmeans initialisation I wasn't able to make the model converge... Would you mind to share how exactly do you do the initialisation?

I haven't got any improvement with kmean inducing initialisation yet. I definitely come back to this if I have some time. Please let me know if you have right models.

JanSochman commented 4 years ago

@gpleiss Can we re-open this issue? It does not seem to be solved yet. Neither of us was able to implement your advice. Could you, please, add a more detailed description of your solution? Thanks a lot!

gpleiss commented 4 years ago

Neither of us was able to implement your advice.

With respect to deep GPs or with respect to single-layer GPs? I can get a single-layer GP to train.

The initialization strategies that I proposed are designed more for single-layer GPs, not deep GPs. Do you know if someone has gotten deep GPs to train on MNIST? My guess is that an RBF/Matern/most kernels for MNIST is not going to be an easy model to learn.

anh-tong commented 4 years ago

I am debugging the problem with the RBF kernel. It seems like the lengthscales of RBF kernel are not learned. Parameter raw_lengthscales has a small gradient (close to 0). For example,

print(model.covar_module.base_kernel.raw_lengthscale.grad) print(model.covar_module.raw_outputscale.grad)

The output is tensor([[-3.0943e-23]], device='cuda:0') tensor(0.2421, device='cuda:0')

gpleiss commented 4 years ago

@anh-tong - if you were to initialize the lengthscale to some extreme value (e.g. 10^2 or 10^{-2}) do you see similar gradients?

Herding commented 4 years ago

@anh-tong - if you were to initialize the lengthscale to some extreme value (e.g. 10^2 or 10^{-2}) do you see similar gradients?

Does call () in the class affect the program execution process? Can you replace call () with forward ()?

gpleiss commented 4 years ago

You need to use __call__(). It does additional processing.

joseduc10 commented 3 years ago

I have encountered similar issues as the OP. When I train a DeepGP for classification with an RBF kernel, the lengthscales do not change at all.

gpleiss commented 3 years ago

@joseduc10 can you please provide a code example?