Closed bletham closed 5 years ago
We estimate the log determinant term in the MLL with stochastic trace estimation. This is the primary source of randomness in the MLL computation, and why it's high variance.
There are two knobs that you can use to decrease the variance: the context managers gpytorch.settings.max_lanczos_quadrature_iterations
and gpytorch.settings.num_trace_samples
. E.g.
with gpytorch.settings.max_lanczos_quadrature_iterations(32), gpytorch.settings.num_trace_samples(128):
mll = ExactMarginalLogLikelihood(likelihood, model)
In general, unless you are using the actual MLL values for something (e.g. for Gibbs sampling), I would only adjust the num_trace_samples
context manager. The max_lanczos_quadrature_iterations
will affect the actual MLL number that you see, but has no effect on the MLL gradients. num_trace_samples
is the only thing that controls the variance of the gradients.
Also it is worth noting that (assuming CG runs to convergence) the gradients returned by our function are unbiased estimators of the true gradient.
A few other thoughts:
Sometimes, even when locking down the seeds, you'll get different results due to floating point errors. I'm assuming that using softplus
for the non-negative transformation rather than exp
(#370) will reduce how much these errors affect the final numerical results.
In general, one reason for high variance is that conjugate gradients might not be converging all the way. This is probably not the case for this dataset since you don't have that much data. (The default number of CG iterations is 20, which is more than the size of your dataset). However, for larger datasets, you can use the context manager with gpytorch.settings.max_cg_iterations(200)
to increase the number of CG iterations, or you can use with gpytorch.settings.max_preconditioner_size(10)
to increase the size of the preconditioner (which makes for much better solves).
@bletham The variance that you'll get will be higher when you have worse conditioned matrices. In this case, with a log_noise of -9 you have a very small diagonal component that makes things challenging.
To actually succinctly answer your question, the simplest knob to tune by far is to just increase the size of the preconditioner with gpytorch.settings.max_preconditioner_size
. For example:
# The first part of your notebook goes here
...
import gpytorch
with gpytorch.settings.max_preconditioner_size(15):
mlls = []
for i in range(10):
mll_i = mll(output, mll.model.train_targets).item()
print(mll_i)
mlls.append(mll_i)
print(torch.tensor(mlls).std()) # Outputs 0.0055
Two things to note:
Just for completeness, there are a number of other parameters that control the various numerical trade-offs. gpytorch.settings.max_cg_iterations
and gpytorch.settings.max_lanczos_quadrature_iterations
make solves and log dets more accurate respectively, while gpytorch.settings.num_trace_samples
reduces variance:
import gpytorch
with gpytorch.settings.num_trace_samples(50):
mlls = []
for i in range(10):
mll_i = mll(output, mll.model.train_targets).item()
print(mll_i)
mlls.append(mll_i)
print(torch.tensor(mlls).std()) # Outputs 0.04
Hah, it seems that Geoff and I replied concurrently :-)
Yeah so regarding values vs gradients: If we're using L-BFGS to optimize this then the actual function value does matter, as it's used extensively in the optimization to perform line search.
An interesting question then is, is there a way of progressively increasing the approximation quality as we're getting closer to the optimum? Maybe using a callback from the solver to set that value?
@jacobrgardner Thanks, the max preconditioner size does seem to help a lot.
I thought there was a noise nugget added to the diagonal to keep things well conditioned independently of likelihood.log_noise. I take it that isn't the case so we should make sure that it stays a bit off from 0?
We add jitter for variational inference so that there's a diagonal component at all (which is a requirement for our preconditioning strategy... For now :-), but we don't add jitter for exact GPs because that effectively enforces a minimum noise level.
I suspect that when using softplus you'll never end up with noise levels on the order of exp(-9) anyways, but a Smooth box prior would also prevent this.
Great, thanks.
We're seeing a rather large amount of variance in the value of MLL with an ExactGP obtained while holding the model and data fixed. Here is a repro case in which we get changes in MLL up to 30% across evaluations:
The output is:
With this particular model, when the parameters are left initialized at 0 (comment out loading the state dict), the MLL is -1.7, so this variance is well within the tolerance with which we'd like to optimize, and makes it challenging to use a non-stochastic optimizer.
I'm guessing this variance is from the approximations being used in estimating MLL? Is there a parameter we can use to tune the fidelity of those approximations to improve the optimization?