Variance in repeated mll evaluations

bletham commented 5 years ago

We're seeing a rather large amount of variance in the value of MLL with an ExactGP obtained while holding the model and data fixed. Here is a repro case in which we get changes in MLL up to 30% across evaluations:

from collections import OrderedDict

import torch

from gpytorch.distributions import MultivariateNormal
from gpytorch.kernels import MaternKernel, RBFKernel, ScaleKernel
from gpytorch.likelihoods import Likelihood, GaussianLikelihood
from gpytorch.means import ConstantMean
from gpytorch.mlls.exact_marginal_log_likelihood import ExactMarginalLogLikelihood
from gpytorch.models import ExactGP
from gpytorch.priors import GammaPrior

# Data

X = torch.Tensor([
    [-0.0843,  7.2533],
    [ 0.5766,  0.1072],
    [ 0.7004,  0.1931],
    [ 0.6442,  0.6576],
    [ 0.9669,  0.9565],
    [ 0.2082,  0.7459],
    [ 0.8764,  0.3689],
    [ 0.0766,  0.6958],
    [ 0.4030,  0.3839],
    [ 0.8547,  0.7535],
    [ 0.6673,  1.0000],
    [ 1.0000,  0.9685],
    [ 0.0520,  0.0000],
    [ 1.0000,  1.0000],
    [ 1.0000,  0.0000],
])

Y = torch.Tensor([
    -1.8684,  0.7750,  0.3887,  0.0143, -0.9510,  0.8281, -0.2027,  1.1532,
    0.8408, -0.5322, -0.3719, -1.0300,  2.1166, -1.0564, -0.1041,
])

# Model

class SingleTaskGP(ExactGP):
    """
    Class implementing a single task exact GP using relatively strong priors on
    the Kernel hyperparameters, which work best when covariates are normalized
    to the unit cube and outcomes are standardized (zero mean, unit variance).
    """

    def __init__(
        self, train_X: torch.Tensor, train_Y: torch.Tensor, likelihood: Likelihood
    ) -> None:
        super().__init__(train_X, train_Y, likelihood)
        self.mean_module = ConstantMean()
        self.covar_module = ScaleKernel(
            MaternKernel(
                nu=2.5,
                ard_num_dims=2,
                log_lengthscale_prior=GammaPrior(2.0, 5.0, log_transform=True),
            ),
            log_outputscale_prior=GammaPrior(1.1, 0.05, log_transform=True),
        )

    def forward(self, x: torch.Tensor) -> MultivariateNormal:
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

likelihood = GaussianLikelihood()
model = SingleTaskGP(X, Y.view(-1), likelihood)

# Set model state
sd = OrderedDict([
    ('likelihood.log_noise', torch.Tensor([[-6.9565]])),
    ('mean_module.constant', torch.Tensor([[-0.1901]])),
    ('covar_module.log_outputscale', torch.Tensor([-0.4566])),
    ('covar_module.base_kernel.log_lengthscale', torch.Tensor([[[-0.8539,  0.3291]]])),
    ('covar_module.base_kernel.log_lengthscale_prior.concentration', torch.Tensor([2.])[0]),
    ('covar_module.base_kernel.log_lengthscale_prior.rate', torch.Tensor([5.])[0]),
    ('covar_module.log_outputscale_prior.concentration', torch.Tensor([1.1000])[0]),
    ('covar_module.log_outputscale_prior.rate', torch.Tensor([0.0500])[0]),
])
model.load_state_dict(sd)

# Evaluate mll
mll = ExactMarginalLogLikelihood(likelihood, model)
output = mll.model(*mll.model.train_inputs)

for i in range(10):
    print(mll(output, mll.model.train_targets).item())

The output is:

-1.0269606113433838
-1.044615387916565
-0.9483365416526794
-1.0176225900650024
-1.2577258348464966
-0.922021210193634
-1.1497862339019775
-0.922558605670929
-1.3168827295303345
-0.8927210569381714

With this particular model, when the parameters are left initialized at 0 (comment out loading the state dict), the MLL is -1.7, so this variance is well within the tolerance with which we'd like to optimize, and makes it challenging to use a non-stochastic optimizer.

I'm guessing this variance is from the approximations being used in estimating MLL? Is there a parameter we can use to tune the fidelity of those approximations to improve the optimization?

gpleiss commented 5 years ago

We estimate the log determinant term in the MLL with stochastic trace estimation. This is the primary source of randomness in the MLL computation, and why it's high variance.

There are two knobs that you can use to decrease the variance: the context managers gpytorch.settings.max_lanczos_quadrature_iterations and gpytorch.settings.num_trace_samples. E.g.

with gpytorch.settings.max_lanczos_quadrature_iterations(32), gpytorch.settings.num_trace_samples(128):
    mll = ExactMarginalLogLikelihood(likelihood, model)

In general, unless you are using the actual MLL values for something (e.g. for Gibbs sampling), I would only adjust the num_trace_samples context manager. The max_lanczos_quadrature_iterations will affect the actual MLL number that you see, but has no effect on the MLL gradients. num_trace_samples is the only thing that controls the variance of the gradients.

Also it is worth noting that (assuming CG runs to convergence) the gradients returned by our function are unbiased estimators of the true gradient.

A few other thoughts:

Sometimes, even when locking down the seeds, you'll get different results due to floating point errors. I'm assuming that using softplus for the non-negative transformation rather than exp (#370) will reduce how much these errors affect the final numerical results.
In general, one reason for high variance is that conjugate gradients might not be converging all the way. This is probably not the case for this dataset since you don't have that much data. (The default number of CG iterations is 20, which is more than the size of your dataset). However, for larger datasets, you can use the context manager with gpytorch.settings.max_cg_iterations(200) to increase the number of CG iterations, or you can use with gpytorch.settings.max_preconditioner_size(10) to increase the size of the preconditioner (which makes for much better solves).

jacobrgardner commented 5 years ago

@bletham The variance that you'll get will be higher when you have worse conditioned matrices. In this case, with a log_noise of -9 you have a very small diagonal component that makes things challenging.

To actually succinctly answer your question, the simplest knob to tune by far is to just increase the size of the preconditioner with gpytorch.settings.max_preconditioner_size. For example:

# The first part of your notebook goes here
...
import gpytorch
with gpytorch.settings.max_preconditioner_size(15):
    mlls = []
    for i in range(10):
        mll_i = mll(output, mll.model.train_targets).item()
        print(mll_i)
        mlls.append(mll_i)
print(torch.tensor(mlls).std()) # Outputs 0.0055

Two things to note:

the standard deviation with default parameters is 0.11, which is about a 10% deviation, so 30% deviations should be relatively rare.
The gradients we get are much lower variance than the mlls because the log det gradient term involves a linear solve, and we are much better at linear solves than at log dets.

Just for completeness, there are a number of other parameters that control the various numerical trade-offs. gpytorch.settings.max_cg_iterations and gpytorch.settings.max_lanczos_quadrature_iterations make solves and log dets more accurate respectively, while gpytorch.settings.num_trace_samples reduces variance:

import gpytorch
with gpytorch.settings.num_trace_samples(50):
    mlls = []
    for i in range(10): 
        mll_i = mll(output, mll.model.train_targets).item()
        print(mll_i)
        mlls.append(mll_i)
print(torch.tensor(mlls).std()) # Outputs 0.04

jacobrgardner commented 5 years ago

Hah, it seems that Geoff and I replied concurrently :-)

Balandat commented 5 years ago

Yeah so regarding values vs gradients: If we're using L-BFGS to optimize this then the actual function value does matter, as it's used extensively in the optimization to perform line search.

An interesting question then is, is there a way of progressively increasing the approximation quality as we're getting closer to the optimum? Maybe using a callback from the solver to set that value?

bletham commented 5 years ago

@jacobrgardner Thanks, the max preconditioner size does seem to help a lot.

I thought there was a noise nugget added to the diagonal to keep things well conditioned independently of likelihood.log_noise. I take it that isn't the case so we should make sure that it stays a bit off from 0?

jacobrgardner commented 5 years ago

We add jitter for variational inference so that there's a diagonal component at all (which is a requirement for our preconditioning strategy... For now :-), but we don't add jitter for exact GPs because that effectively enforces a minimum noise level.

I suspect that when using softplus you'll never end up with noise levels on the order of exp(-9) anyways, but a Smooth box prior would also prevent this.

bletham commented 5 years ago

Great, thanks.

cornellius-gp / gpytorch

Variance in repeated mll evaluations #377