cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.56k stars 559 forks source link

[Docs] Get started with DeepGPs regression #1080

Open Hephaistos96 opened 4 years ago

Hephaistos96 commented 4 years ago

Good morning,

Context & Issue

I am currently in the process of discovering DeepGPs as part of a school project.

I have already tried to use a library created by the Sheffield ML group, deepgp, that generates pretty good and consistent results but whose computation time gets particularly long with training sets containing only a few thousand points (code scalability is a major constraint of my project).

As a consequence, I have decided to explore new approaches and in particular GPyTorch Deep GPs. I have followed the implementation tutorial available on the documentation website and I indeed got a significant improvement in time performance. However, I haven't been able to get the same prediction accuracy as what I had with the other library.

Here is an illustration of my issue (full code at the end of the post) I designed a function containing multiple "steps", picked randomly a few points in the interval [-2 ; 2] and ran both deepgp and GPyTorch DeepGPs to generate a mean prediction on the same interval. As you can see, it seems that GPyTorch produces a result that is a lot "noiser":

GPyTorch

WhatsApp Image 2020-03-14 at 19 56 14 (1)

deepgp

WhatsApp Image 2020-03-14 at 19 56 14

I am not sure of what is causing that issue. Since GPyTorch Deep GPs offer you to control a lot of different parameters (deep gp layers structure, number of inducing points, number of samples, batch size, etc.) that deepgp manages on its own, I thought that my GPyTorch DeepGP could be parameterized incorrectly. I have tried a lot of different parameters combinations but nothing really worked.

I would therefore appreciate to get your opinion on the potential source of my issue.

Question

Am I right thinking that my model parameters aren't set perfectly or do you think that my issue has another explanation? In the first case, do you know where could I find some documentation about the influence of each parameter and the way to choose them as wisely as possible?

Additional notes, questions

My project actually consists in 3D regressions on training sets that look like this:

dataCapture

I have tried to run GPyTorch Deep GPs on this kind of example and the result is a lot worse to what I get in the 2D case: I get constant prediction on the whole prediction surface. On simpler 3D functions (x² + y² for instance) and with a training set consisting in regularly spaced points, I get a good result though.

Code (2D example)

import numpy as np
import plotly.graph_objs as go

import gpytorch

import torch
import tqdm
from torch.nn import Linear
from gpytorch.means import ConstantMean, LinearMean
from gpytorch.kernels import RBFKernel, ScaleKernel
from gpytorch.variational import VariationalStrategy, CholeskyVariationalDistribution
from gpytorch.distributions import MultivariateNormal
from gpytorch.models import ApproximateGP, GP
from gpytorch.mlls import VariationalELBO, AddedLossTerm
from gpytorch.likelihoods import GaussianLikelihood
from torch.utils.data import TensorDataset, DataLoader
from gpytorch.models.deep_gps import DeepGPLayer, DeepGP
from gpytorch.mlls import DeepApproximateMLL
def latentFunction(x):
    if(x<-1.5):
        return 0
    if(x<-1):
        return 1
    if(x<0):
        return 0
    if(x<0.5):
        return -1
    if(x<1):
        return 0
    if(x<1.5):
        return 1
    return 0

def plotResults(x_samples, y_samples, x_rep, mean, var):
    y_std =np.sqrt(var)
    yt_upper = mean + 2*y_std
    yt_lower = mean - 2*y_std

    trace_low = go.Scatter(
        x=x_rep,
        y=yt_lower.flatten(),
        marker=dict(color="#444"),
        line=dict(width=0),
        mode='lines',
    name='lower boundary')

    trace_up= go.Scatter(
        x=x_rep,
        y=yt_upper.flatten(),
        mode='lines',
        name='upper boundary',
        marker=dict(color="#444"),
        line=dict(width=0),
        fillcolor='rgba(68, 68, 68, 0.3)',
        fill='tonexty'
    )

    fig = go.Figure(data=[
        go.Scatter(x=x_samples, y=y_samples, mode='markers', marker=dict(size=6, color= '#e74c3c'), name="samples"),
        go.Scatter(x=x_rep, y=mean, mode='lines', line_color="#16a085", name="mean"),
        trace_low,
        trace_up,
    ])
    fig.update_layout(showlegend=True, title="Results")
    fig.show()
n_samples = 150
noise = 0.03

x_samples = np.random.uniform(-2, 2, n_samples)
y_samples = np.array([latentFunction(x) for x in x_samples]) + np.random.normal(0, noise, n_samples)

n_rep = 200
x_rep = np.linspace(-2.2, 2.2, n_rep)
y_rep = np.array([latentFunction(x) for x in x_rep])

fig = go.Figure(data=[
    go.Scatter(x=x_samples, y=y_samples, mode='markers', marker=dict(size=6, color= '#e74c3c'), name="samples"),
    go.Scatter(x=x_rep, y=y_rep, mode='lines', line_color="#2980b9", name="latent function"),
])

fig.update_layout(showlegend=True, title="Synthetic example")
fig.show()

deepgp module

def initialize(self, noise_factor=0.01, linear_factor=1):
    """Helper function for deep model initialization."""
    self.obslayer.likelihood.variance = self.Y.var()*noise_factor #nice initialization
    for layer in self.layers:
        if type(layer.X) is GPy.core.parameterization.variational.NormalPosterior:
            if layer.kern.ARD:
                var = layer.X.mean.var(0)
            else:
                var = layer.X.mean.var()
        else:
            if layer.kern.ARD:
                var = layer.X.var(0)
            else:
                var = layer.X.var()

        # Average 0.5 upcrossings in four standard deviations. 
        layer.kern.lengthscale = linear_factor*np.sqrt(layer.kern.input_dim)*2*4*np.sqrt(var)/(2*np.pi)
# Bind the new method to the Deep GP object.
deepgp.DeepGP.initialize=initialize

def staged_optimize(self, iters=(1000,1000,30000), messages=(False, False, True)):
    """Optimize with parameters constrained and then with parameters released"""
    for layer in self.layers:
        # Fix the scale of each of the covariance functions.
        layer.kern.variance.fix(warning=False)
        layer.kern.lengthscale.fix(warning=False)

        # Fix the variance of the noise in each layer.
        layer.likelihood.variance.fix(warning=False)

    self.optimize(messages=messages[0],max_iters=iters[0])

    for layer in self.layers:
        layer.kern.lengthscale.constrain_positive(warning=False)
    self.obslayer.kern.variance.constrain_positive(warning=False)

    self.optimize(messages=messages[1],max_iters=iters[1])

    for layer in self.layers:
        layer.kern.variance.constrain_positive(warning=False)
        layer.likelihood.variance.constrain_positive(warning=False)
    self.optimize(messages=messages[2],max_iters=iters[2])

# Bind the new method to the Deep GP object.
deepgp.DeepGP.staged_optimize=staged_optimize
X_training = x_samples.reshape((n_samples, 1))
Y_training = y_samples.reshape((n_samples, 1))

layers = [Y_training.shape[1], 1, 1, 1,X_training.shape[1]]
inits = ['PCA']*(len(layers)-1)
kernels = []
for i in layers[1:]:
    kernels += [GPy.kern.RBF(i)]
m = deepgp.DeepGP(
    layers,Y=Y_training, X=X_training, 
    inits=inits, 
    kernels=kernels,
    num_inducing=20, back_constraint=False
)

# takes 1 minute
m.initialize()
m.staged_optimize(messages=(True,True,True))

mean,var = m.predict(x_rep.reshape(n_rep, 1))
plotResults(x_samples, y_samples, x_rep, mean.flatten(), var)

GPyTorch Deep GPs

from gpytorch.models.deep_gps import DeepGPLayer, DeepGP

class ToyDeepGPHiddenLayer(DeepGPLayer):
    def __init__(self, input_dims, output_dims, num_inducing=20, mean_type='constant'):
        if output_dims is None:
            inducing_points = torch.randn(num_inducing, input_dims)
            batch_shape = torch.Size([])
        else:
            inducing_points = torch.randn(output_dims, num_inducing, input_dims)
            batch_shape = torch.Size([output_dims])

        variational_distribution = CholeskyVariationalDistribution(
            num_inducing_points=num_inducing,
            batch_shape=batch_shape
        )

        variational_strategy = VariationalStrategy(
            self,
            inducing_points,
            variational_distribution,
            learn_inducing_locations=True
        )

        super(ToyDeepGPHiddenLayer, self).__init__(variational_strategy, input_dims, output_dims)

        if mean_type == 'constant':
            self.mean_module = ConstantMean(batch_shape=batch_shape)
        else:
            self.mean_module = LinearMean(input_dims)
        self.covar_module = ScaleKernel(
            RBFKernel(batch_shape=batch_shape, ard_num_dims=input_dims),
            batch_shape=batch_shape, ard_num_dims=None
        )

        self.linear_layer = Linear(input_dims, 1)

    def forward(self, x):
        mean_x = self.mean_module(x) # self.linear_layer(x).squeeze(-1)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

class DeepGP(DeepGP):

    def __init__(self, train_x_shape, num_output_dims, num_inducing):
        hidden_layer = ToyDeepGPHiddenLayer(
            input_dims=train_x_shape[-1],
            output_dims=num_output_dims,
            mean_type='linear',
            num_inducing=num_inducing
        )

        last_layer = ToyDeepGPHiddenLayer(
            input_dims=hidden_layer.output_dims,
            output_dims=None,
            mean_type='constant',
            num_inducing=num_inducing
        )

        super().__init__()

        self.hidden_layer = hidden_layer
        self.last_layer = last_layer
        self.likelihood = GaussianLikelihood()

    def forward(self, inputs):
        hidden_rep1 = self.hidden_layer(inputs)
        output = self.last_layer(hidden_rep1)
        return output

    def predict(self, test_loader):
        with torch.no_grad():
            mus = []
            variances = []
            lls = []
            for x_batch, y_batch in test_loader:
                preds = model.likelihood(model(x_batch))
                mus.append(preds.mean)
                variances.append(preds.variance)
                lls.append(model.likelihood.log_marginal(y_batch, model(x_batch)))

        return torch.cat(mus, dim=-1), torch.cat(variances, dim=-1), torch.cat(lls, dim=-1)    
num_output_dims=1
num_inducing=20
num_epochs = 200
num_samples = 150
batch_size = 25

train_x = torch.from_numpy(x_samples.reshape((n_samples, 1)).astype('float32'))
train_y = torch.from_numpy(y_samples.astype('float32'))

train_dataset = TensorDataset(train_x, train_y)
model = DeepGP(train_x.shape, num_output_dims, num_inducing)

# training
optimizer = torch.optim.Adam([{'params': model.parameters()}], lr=0.01)
mll = DeepApproximateMLL(VariationalELBO(model.likelihood, model, train_x.shape[-2]))
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
epochs_iter = tqdm.tqdm_notebook(range(num_epochs), desc="Epoch")

for i in epochs_iter:
    minibatch_iter = tqdm.tqdm_notebook(train_loader, desc="Minibatch", leave=False)
    for x_batch, y_batch in minibatch_iter:
        with gpytorch.settings.num_likelihood_samples(num_samples):
            optimizer.zero_grad()
            output = model(x_batch)
            loss = -mll(output, y_batch)
            loss.backward()
            optimizer.step()
            minibatch_iter.set_postfix(loss=loss.item())
# predict
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    gaussianDistribution = model(torch.from_numpy(x_rep.reshape((n_rep, 1)).astype('float32')))
    mean = gaussianDistribution.mean.mean(0).numpy()
    var = gaussianDistribution.variance.mean(0).numpy()

plotResults(x_samples, y_samples, x_rep, mean, var)
jacobrgardner commented 4 years ago

@Hephaistos96 I'm like 99% confident this is just a mismatch between optimizer hyperparameter settings. Looking at the ones you use, a batch size of 25 is pretty small, so that might be the culprit? Maybe you can find the hypers that are sort of hidden by the .optimize call?

In my experience, 150 samples is pretty crazily high (e.g., you probably don't need your predictive distributions to in practice be represented by a mixture of 150 Gaussians). It's probably better on real data to swap that out for having just a few more inducing points.