Convergence of parameters on a synthetic dataset

We created a synthetic 1-D dataset that can be described as the following:

We consider n =1, 024, xi i.i.d. ∼ N (0, 25) and yn ∼ N (0, σf*2 Kf,n + σe*2 In). Kf,n is an RBF kernel matrix with known lengthscale l = 0.5.

The code for generating the dataset is shown below:

def create_dataset(r_seed=1,N=1024):
    torch.manual_seed(0)
    x_dist = Normal(torch.tensor([0.0]), torch.tensor([5.0]))
    X = x_dist.sample((N,))
    K = ScaleKernel(RBFKernel())
    K.base_kernel.lengthscale = 0.5
    K.outputscale = 4
    K_ = K(X,X) + (1**2)*torch.eye(len(X))
    dist = MultivariateNormal(torch.zeros((1024,)), K_.evaluate())
    torch.manual_seed(r_seed)
    Y = dist.sample()
    return X, Y
create_dataset(9)

Thus, the true parameters of the created dataset are lengthscale = 0.5, noise variance = 1 and output scale = 4.

We then trained an EGP model in GPytorch. During the training process, the .grad of length scale and mean constant were set to false (that is, they were fixed to their true values and were not trainable). The outputscale and noise variance are initialized with values 5 and 3, respectively.

Training using 1st order optimizer (SGD, lr decayed in each iteration using lr = lr/k):
```
lr = 9
optimizer = torch.optim.SGD(model.parameters(),lr=lr)
for i in range(50): 
output = model(X.to(device))
loss = -mll(output, Y.to(device))
optimizer.zero_grad()
loss.backward()
optimizer.step()
for group in optimizer.param_groups:
group['lr'] = lr/(i+1)
```
This is how the convergence of parameters with each iteration looks:

It can be seen that although the noise variance converges to its true value (=1), the output scale does not.

Training using 2nd order optimizer (LBFGS):

optimizer = torch.optim.LBFGS(model.parameters(),lr=0.1)
def closure():
output = model(X.to(device))
loss = -mll(output, Y.to(device))
optimizer.zero_grad()
loss.backward()
return loss
for i in range(50): 
optimizer.step(closure)

This is how the convergence of parameters with each iteration looks:

Both the hyperparameters converge to the approximate true value.

The plot of negative marginal log-likelihood vs parameter values (keeping other parameters fixed to their true values) looks like this:

Note that this plot is independent of the optimizer since there is no training involved- just the computation of loss.

While the lengthscale and noise variance have global minimas at their respective true values, this is not the case for the output scale.

Why is the output scale not learnt in the case of first-order optimization even in the simple case of normally distributed data with a known lengthscale?
Why does second-order optimization result in better convergence? Is that expected?
What is the reason behind the absence of a global minimum at the true value in the loss vs output scale plot? (On using other frameworks, a smooth plot with a minimum at the true value for all 3 parameters was obtained)

The rough answer to all of these issues is non-convexity of these parameters with respect to the loss function (e.g. the MLL) which is why lbfgs-b performs better than SGD here. Quasi-newton-like methods tend to perform better at some types of non-convex problems.

The presence of the local minima on the profile plot that you displayed is why SGD converges to something sub-optimal.

To improve the optimization results, I would try a constant learning rate with momentum or something like Adam (which is used in most of the tutorial). I suspect that some kind of adaptivity could help you push beyond the local minima.

From an optimization perspective, with n = 1024, you're using CG and lanczos (well mostly CG) to estimate the gradients of these parameters. There's a known but slight bias in the gradients when using CG there, see section 3.1 of this paper and this paper. To improve the optimization results while still using SGD, you can try a) changing the maximum cholesky size, see https://docs.gpytorch.ai/en/stable/settings.html#gpytorch.settings.max_cholesky_size b) decreasing the cg tolerance to something like 0.01 or less, see https://docs.gpytorch.ai/en/stable/settings.html#gpytorch.settings.cg_tolerance.

Either of those will probably produce a more smooth plot for the lengthscale parameter, especially the cholesky one.

cornellius-gp / gpytorch

Convergence of parameters on a synthetic dataset #1962