cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.54k stars 557 forks source link

How to implement multiple restarts in GPytorch - ML-II training #1091

Open vr308 opened 4 years ago

vr308 commented 4 years ago

It is recommended that ML-II training (basically optimisation) should be accompanied with multiple restarts (and pick the one with the highest marginal likelihood) to avoid bad local optima. In some kernels like spectral mixture, the local minima problem is catastrophic.

Couldn't find an example in the docs which shows how to set the number of restarts explicitly. I've concluded that one has to do that manuall by running the snippet below repeatedly?

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 100
for i in range(training_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()
gpleiss commented 4 years ago

That is correct. For the spectral mixture kernel, different seeds should get you different restarts.

vr308 commented 4 years ago

thanks.. another thing about spectral mixture kernels is that I noticed that the mixture weights dont add up to 1 (either in the initialisation or the final trained estimates).. dont they have to? for the underlying spectral density of the kernel to be a valid density..?

vr308 commented 4 years ago

On the issue of multiple restarts: I adapted the tutorial code to run the training part multiple times, preserve the trained model for each run and pick the one which converged to the lowest negative LML.

Perform ML-II with multiple restarts

print('Performing ML-II with multiple restarts')

n_restarts = 5

num_steps = 700

loss_traces = np.empty(shape=(n_restarts,num_steps))
means_traces = np.empty(shape=(n_restarts,num_steps,2))
scales_traces = np.empty(shape=(n_restarts,num_steps,2))
weights_traces = np.empty(shape=(n_restarts,num_steps, 2))
noise_traces = np.empty(shape=())
trained_models = []
for i in np.arange(n_restarts):

     likelihood = gpytorch.likelihoods.GaussianLikelihood()
     model_ml = SpectralMixtureGPModel(train_x, train_y, likelihood, 2)

     model_ml.train()
     likelihood.train()
     # Use the adam optimizer
     optimizer = torch.optim.Adam(model_ml.parameters(), lr=0.01)
     # "Loss" for GPs - the marginal log likelihood
     mll_ml = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model_ml)

     loss_trace, means_trace, scales_trace, weights_trace, noise_trace = [], [], [], [], []

     for j in range(num_steps):
         optimizer.zero_grad()
         output = model_ml(train_x)
         loss = -mll_ml(output, train_y)
         loss_trace.append(loss)
         means_trace.append(model_ml.covar_module.mixture_means.detach().flatten().numpy())
         weights_trace.append(model_ml.covar_module.mixture_weights.detach().flatten().numpy())
         scales_trace.append(model_ml.covar_module.mixture_scales.detach().flatten().numpy())
         noise_trace.append(model_ml.likelihood.noise.detach().numpy())
         loss.backward()
         if (j%100 == 0):
             print('Iter %d/%d - Loss: %.3f' % (j + 1,num_steps, loss.item()))
         optimizer.step()
     loss_traces[i] = loss_trace
     means_traces[i] = np.array(means_trace)
     scales_traces[i] = np.array(scales_trace)
     weights_traces[i] = np.array(weights_trace)

     trained_models.append((model_ml, likelihood))

converged_lml = [x[-1] for x in loss_traces]

# Select the one with the highest likelihood
final_model = trained_models[np.argmin(converged_lml)][0]
final_likelihood = trained_models[np.argmin(converged_lml)][1]

q1: I have the feeling I am not doing it the right way... q2: Is there a better way to extract the trace of the means, scales and weights from the kernel at every step ?

jacobrgardner commented 4 years ago

The weights don't really have to sum to 1. If you'd like, you can think of having them not sum to 1 as just having normalized weights with an outputscale built in equal to the sum of the weights. E.g., replace w{i} with w{i} / \sum{j=1}^{Q} w{j}, and then k{scaled}(\tau) = \sum{j=1}^{Q}w{j} * k{normalized_sm}(\tau).

I wouldn't store all of the models around necessarily. I might do:

model_states = []
losses = torch.zeros(n_restarts, num_steps)
...
model_states.append({param_name: param.detach() for param_name, param in model_ml.state_dict().items()})
losses[i, j] = loss.item()

Then:

best_final_idx = torch.argmin(losses[:, -1])
best_model_params = model_states[best_final_idx]
model = CreateNewModel()
model.load_state_dict(best_model_params)

Note that model_states also has a trace of the covar module hypers by definition. You could get them using a dictionary comprehension, something like:

hyper_trace_dict_i = {param_name: param for param_name, param in model_states[i] if 'covar_module' in param_name}
vr308 commented 4 years ago

thanks for that! ...just noticed in the code that the initialise from data method uses (std.dev (y) / q ) for weights ...looking at your argument above shouldn't it be (var(y)/q) as the variances add together and not the scales.?

#  Mixture weights should be roughly the stdv of the y values divided by the number of mixtures
        self.raw_mixture_weights.data.fill_(train_y.std() / self.num_mixtures)
gpleiss commented 4 years ago

@vr308 our initialization code follows @andrewgordonwilson 's initialization guidelines

the weights are related to the signal variance of the data. If we look at k(0,0) we get the sum of the weights. This value is also the signal variance of our distribution over functions. We therefore initialize the weights as the standard deviation of the data divided by the number of components (assuming that the noise standard deviation is negligable compared to the signal). All the weights are initialized to the same value here.

frgsimpson commented 4 years ago

I think @vr308 is correct, and that there's an error in the guidelines. Note that the current prescription is not dimensionally consistent. If y is in units of M then y.std is also in units of M, but the weights are in units of M^2.

gpleiss commented 4 years ago

Sorry for the slow response! I would buy that argument, and we'd be open to a PR to change that!

andrewgordonwilson commented 4 years ago

It depends on the parametrization. Those guidelines were assuming the weights are 'w' (rather than 'w^2') and the kernel is parametrized as \sum_j wj^2 k{sm})_j. In general I've found it more numerically stable to work with parameters rather than squared parameters (e.g., length-scale rather than length-scale squared, noise std rather than noise variance, etc). I've also found this tends to be less prone to bugs from the user-side, as often the non-squared hyperparameters are more interpretable.