Open anh-tong opened 4 years ago
Actually, single-layer variational GPs just have 10% accuracy on MNIST.
https://colab.research.google.com/drive/1jxCUKIgZHbNtkE5qttcgmnQv7O-bya4R
Do you have any idea what's wrong with this code?
Is this because of SoftmaxLikelihood
?
This is not a bug with GPyTorch. The problem is that it is difficult to optimize the inducing point locations for a 784 dimensional function - especially when they are randomly initialized.
Try initializing the inducing points of the single-layer SVGP (or the first layer of the deep GP) to be some training data samples, or run k-means clustering with num_inducing
clusters and use the cluster centroids as inducing point initializations. When I do this running your example the loss goes down rapidly after ~2 epochs.
@anh-tong have you succeeded in making your model to train? I arrived at exactly the same situation today and even with adding kmeans initialisation I wasn't able to make the model converge... Would you mind to share how exactly do you do the initialisation?
@JanSochman just try initializing the inducing points to a random subset of data first. That'll get you most of the way there.
That was my first try before trying kmeans :) I passed them through Normal_DeepGP into the first layer and initialised the inducing points there. Gives me 11% accuracy after 10 epochs of stagnation at this value.
This is the affected part:
class HiddenLayer(DeepGPLayer):
def __init__(self, input_dims, output_dims, num_inducing=128, init_inducing_points=None):
if output_dims is None:
inducing_points = torch.randn(num_inducing, input_dims)
if torch.cuda.is_available():
inducing_points = inducing_points.cuda()
batch_shape = torch.Size([])
else:
if init_inducing_points is None:
inducing_points = torch.randn(output_dims, num_inducing, input_dims)
else:
inducing_points = init_inducing_points.expand(output_dims, -1, -1) # <-- HERE
if torch.cuda.is_available():
inducing_points = inducing_points.cuda()
batch_shape = torch.Size([output_dims])
@anh-tong have you succeeded in making your model to train? I arrived at exactly the same situation today and even with adding kmeans initialisation I wasn't able to make the model converge... Would you mind to share how exactly do you do the initialisation?
I haven't got any improvement with kmean inducing initialisation yet. I definitely come back to this if I have some time. Please let me know if you have right models.
@gpleiss Can we re-open this issue? It does not seem to be solved yet. Neither of us was able to implement your advice. Could you, please, add a more detailed description of your solution? Thanks a lot!
Neither of us was able to implement your advice.
With respect to deep GPs or with respect to single-layer GPs? I can get a single-layer GP to train.
The initialization strategies that I proposed are designed more for single-layer GPs, not deep GPs. Do you know if someone has gotten deep GPs to train on MNIST? My guess is that an RBF/Matern/most kernels for MNIST is not going to be an easy model to learn.
I am debugging the problem with the RBF kernel. It seems like the lengthscales of RBF kernel are not learned. Parameter raw_lengthscales
has a small gradient (close to 0). For example,
print(model.covar_module.base_kernel.raw_lengthscale.grad) print(model.covar_module.raw_outputscale.grad)
The output is
tensor([[-3.0943e-23]], device='cuda:0') tensor(0.2421, device='cuda:0')
@anh-tong - if you were to initialize the lengthscale to some extreme value (e.g. 10^2 or 10^{-2}) do you see similar gradients?
@anh-tong - if you were to initialize the lengthscale to some extreme value (e.g. 10^2 or 10^{-2}) do you see similar gradients?
Does call () in the class affect the program execution process? Can you replace call () with forward ()?
You need to use __call__()
. It does additional processing.
I have encountered similar issues as the OP. When I train a DeepGP for classification with an RBF kernel, the lengthscales do not change at all.
@joseduc10 can you please provide a code example?
Hi,
I wonder if there are any examples for Deep GP for classification.
I build a code that followed the example for regression. However, in classification tasks, it seems like the model cannot be learned. There are some issues with Deep GP with very deep layers (https://arxiv.org/abs/1402.5836). In this case, the number of layers is just 2.
Here is the example code. https://colab.research.google.com/drive/1bVcFLVcMdOQm2fo0AiwR-kZOQQoSllm3
Thanks.