cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 562 forks source link

[Docs] save and load variuos GP models (vanilla GP, varGP and sparse GP) #1243

Open RodrigoAVargasHdz opened 4 years ago

RodrigoAVargasHdz commented 4 years ago

📚 Documentation/Examples

Hi,

I've been trying to use variational GP (varGP) and sparse GP (spGP) for a data set that is semi-large, approx 20,000 points training. I've been having some problems since a GP trained in sklearn with only 3000 points gives a better result than a varGP with +1000 inducing points; all methods are tested with the same RBF kernel. The only difference is I normalize the data set for the GPytorch models and for the sklearn I don't.

I was wondering if you could guide me on my implementation of these three methods in Gpytorch. My code is 100% based on the online documentation, see the codes example below.

---------------------------------------------------------------------------------------

Vanilla GP

class GPModel(gpytorch.models.ExactGP): def init(self, train_x, train_y, likelihood): super(GPModel, self).init(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=51))

def forward(self, x):
    mean_x = self.mean_module(x)
    covar_x = self.covar_module(x)
    return MultivariateNormal(mean_x, covar_x)

initialize GP model

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPModel(train_x, train_y, likelihood,d)

optimizer

optimizer = torch.optim.Adam([
{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.01)

loss function

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

select best model during training

gp_file ---> file name

loss0 --> lowest -mll known value

if loss.item() < loss0:
    loss0 = loss.item()
    state_dict = model.state_dict()
    torch.save(state_dict, gp_file)

load best model

state_dict_file = torch.load(gp_file)
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPModel(train_x, train_y, likelihood,d)
model.load_state_dict(state_dict_file)  

prediction

preds = likelihood(model(x))
mean = preds.mean.cpu() 

---------------------------------------------------------------------------------------

Variational GP

class varGPModel(ApproximateGP): def init(self, inducing_points): variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0)) variational_strategy = VariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=True) super(GPModel, self).init(variational_strategy) self.mean_module = gpytorch.means.ConstantMean() self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=51))

def forward(self, x):
    mean_x = self.mean_module(x)
    covar_x = self.covar_module(x)
    return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)   

initialize model,

n_inducing_points ---> number of inducing points

i0 = torch.randint(0, train_x.size(0), (n_inducing_points,))
inducing_points = train_x[i0]
model = varGPModel(inducing_points=inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()        

optimizer

optimizer = torch.optim.Adam([
{'params': model.parameters()},
{'params': likelihood.parameters()},
], lr=0.01)

loss function

mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))

save best model during training

gp_file ---> file name

loss0 --> lowest -mll known value

if loss.item() < loss0:
    loss0 = loss.item()
    torch.save(model.state_dict(), gp_file)
    inducing_points = model.variational_strategy.inducing_points.detach()

load best model

state_dict = torch.load(gp_file)
model = varGPModel(inducing_points=inducing_points)
model.load_state_dict(state_dict)   

prediction

preds = model(x_batch)
means = torch.cat([means, preds.mean.cpu()])

---------------------------------------------------------------------------------------

sparse GP

class sparseGPModel(gpytorch.models.ExactGP): def init(self,x_inducing_points,train_x, train_y, likelihood): super(GPModel, self).init(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ConstantMean() self.base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=51)) self.covar_module = gpytorch.kernels.InducingPointKernel(self.base_covar_module, inducing_points=x_inducing_points, likelihood=likelihood)

def forward(self, x):
    mean_x = self.mean_module(x)
    covar_x = self.covar_module(x)
    return MultivariateNormal(mean_x, covar_x)

initialize model,

n_inducing_points ---> number of inducing points

i0 = torch.randint(0, train_x.size(0), (n_inducing_points,))
inducing_points = train_x[i0]
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = sparseGPModel(inducing_points,train_x, train_y, likelihood)

optimizer

optimizer = torch.optim.Adam([
{'params': model.parameters()},
], lr=0.01)

loss function

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

save best model during training

gp_file ---> file name

loss0 --> lowest -mll known value

if loss.item() < loss0:
    loss0 = loss.item()
    state_dict = model.state_dict()
    torch.save(state_dict, gp_file)
    inducing_points = model.covar_module.inducing_points.detach()

load best model

state_dict = torch.load(gp_file)
model = GPModel(inducing_points,train_x, train_y, likelihood)
model.load_state_dict(state_dict)

prediction

preds = model(x_batch)
means = torch.cat([means, preds.mean.cpu()])
jacobrgardner commented 4 years ago

@RodrigoAVargasHdz This all looks fine to me for the most part. Note that for variational GPs, you'll also need to save the likelihood's state separately, as the likelihood doesn't exist on the model. Also, you don't really need to store the inducing points separately or anything -- those are parameters on the model as well, so exist in the model's state dict.

If this doesn't answer your questions, feel free to reopen!

jacobrgardner commented 4 years ago

Sorry, I see you were also having performance question issues. sklearn does not use variational inference -- are you having issues with exact GP performance matching as well?

RodrigoAVargasHdz commented 4 years ago

Hi Jake, thanks for getting back to me.

I know sklearn does not use VI for GPs, however, I compared both vanilla GPs, GPytorch vs Sklearn, and still get a better result using the one from Sklearn. I uploaded both of my codes for the various GPs I am using, ig., sklearn GP, GPytorch vanilla, sparse and variational.

https://github.com/RodrigoAVargasHdz/GPytorch_PES.git

Thanks for all your help :)

On Tue, Aug 18, 2020 at 7:43 PM Jake Gardner notifications@github.com wrote:

Sorry, I see you were also having performance question issues. sklearn does not use variational inference -- are you having issues with exact GP performance matching as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cornellius-gp/gpytorch/issues/1243#issuecomment-675772298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXHX4SBVMCN6N3QHUM4RPLSBMG2LANCNFSM4PXW6H7Q .

RodrigoAVargasHdz commented 4 years ago

When I combine two Matern kernels or two RBF, the kerenels' parameters after training are exactly the same.
I defined my kernels as, ScaleKernel(MaternKernel(nu=2.5,ard_num_dims=d)) + ScaleKernel(MaternKernel(nu=2.5,ard_num_dims=d))

Also, as I mentioned before, for a test set I can not achieve the accuracy of a GP model trained in sklearn. I tried to increase the precision of the model to double and run longer optimizations but the difference between both models is quite large.

jacobrgardner commented 4 years ago

@RodrigoAVargasHdz So, there was a lot of code in your repo and rather than combing through it to find the bug I just coded up a working version on your dataset myself:

Notebook_for_Rodrigo.ipynb.txt

The difference in error I get in the notebook is pretty small -- 0.0040 vs 0.0060 -- and the test NLLs are extremely close. The small difference is most likely explained by one of two sources:

  1. Your sklearn model as coded doesn't learn a likelihood noise (and uses a constant 1e-10 jitter, equivalent to a likelihood noise of 1e-10), while GPyTorch by default has a noise_constraint=GreaterThan(1e-4) in GaussianLikelihood, which we default to primarily for numerical stability reasons. Note that you can actually see the numerical stability issues in sklearn that result from using such a tiny added diagonal: I had to clamp the predictive standard deviations at a positive value because some were <= 0.

  2. sklearn takes like 30x longer to train in part because its using BFGS which is way less approximate than Adam, so you might get \epsilon better hyperparameters if you use PyTorch-BFGS for training.