Open palak-purohit opened 2 years ago
The rough answer to all of these issues is non-convexity of these parameters with respect to the loss function (e.g. the MLL) which is why lbfgs-b performs better than SGD here. Quasi-newton-like methods tend to perform better at some types of non-convex problems.
The presence of the local minima on the profile plot that you displayed is why SGD converges to something sub-optimal.
To improve the optimization results, I would try a constant learning rate with momentum or something like Adam (which is used in most of the tutorial). I suspect that some kind of adaptivity could help you push beyond the local minima.
From an optimization perspective, with n = 1024
, you're using CG and lanczos (well mostly CG) to estimate the gradients of these parameters. There's a known but slight bias in the gradients when using CG there, see section 3.1 of this paper and this paper. To improve the optimization results while still using SGD, you can try
a) changing the maximum cholesky size, see https://docs.gpytorch.ai/en/stable/settings.html#gpytorch.settings.max_cholesky_size
b) decreasing the cg tolerance to something like 0.01 or less, see https://docs.gpytorch.ai/en/stable/settings.html#gpytorch.settings.cg_tolerance.
Either of those will probably produce a more smooth plot for the lengthscale parameter, especially the cholesky one.
We created a synthetic 1-D dataset that can be described as the following:
The code for generating the dataset is shown below:
Thus, the true parameters of the created dataset are lengthscale = 0.5, noise variance = 1 and output scale = 4.
We then trained an EGP model in GPytorch. During the training process, the .grad of length scale and mean constant were set to false (that is, they were fixed to their true values and were not trainable). The outputscale and noise variance are initialized with values 5 and 3, respectively.
Training using 1st order optimizer (SGD, lr decayed in each iteration using lr = lr/k):
This is how the convergence of parameters with each iteration looks:
It can be seen that although the noise variance converges to its true value (=1), the output scale does not.
Training using 2nd order optimizer (LBFGS):
This is how the convergence of parameters with each iteration looks:
Both the hyperparameters converge to the approximate true value.
The plot of negative marginal log-likelihood vs parameter values (keeping other parameters fixed to their true values) looks like this:
![](https://user-images.githubusercontent.com/70364627/161385684-175f3a03-cc44-4cba-9b18-0ffc29f57be1.png)
Note that this plot is independent of the optimizer since there is no training involved- just the computation of loss.
While the lengthscale and noise variance have global minimas at their respective true values, this is not the case for the output scale.