Closed jimmyrisk closed 1 year ago
A couple of thoughts:
All of the behavior that you report is somewhat expected. 5x slower seems like a bit more than what I've seen in the past, but you are using a composite kernel which generally slows down the performance of CG.
Once you scale up beyond n=1000, you will notice that CG starts becoming faster than Cholesky. See what happens when I bump up the data size in your experiment (and set the preconditioner size to 1000):
Data shape: torch.Size([7050, 3])
Time with fast computations off: 7.5905442237854
Cholesky loss (trial 1): -2.459498167037964
Time with fast computations on: 3.1095833778381348
Fast computations loss (trial 1): -2.460662841796875
Fast computations loss (trial 2): -2.4602770805358887
Fast computations loss (trial 3): -2.460109233856201
The difference between Cholesky and the CG-based approaches is due to 1) the stochasticity of the CG approach, and 2) the bias that's introduced by CG (see this paper). We find that these differences can be significant if you're running for very few CG iterations, but usually don't have a huge impact when you let CG run for a while.
The data that you are simulating seem to be ill conditioned and do not respond well to the default pivoted cholesky preconditioner. I think this could be due to the low dimensionality and the limited amount of observational noise.
All in all, the CG-based inference code is some of the best tested/maintained parts of our codebase, and so I don't think this behavior represents a bug. However, I will contend that we should do the following:
max_cholesky_size
to something much larger (e.g. n=2000 or n=4000) at which point CG starts to become much more advantageousfast_computations
to something a bit more objective.@jacobrgardner thoughts?
I've run into a similar issue. I'm seeing much faster predictions when fast_computations
is turned off. Here's a minimal test case to reproduce the issue:
import gpytorch
import torch
import time
# Test data
torch.manual_seed(0)
train_x = torch.randn(4000, 2)
train_y = torch.randn(4000)
test_x = torch.randn(5100, 2)
# Construct model
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ZeroMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
model = ExactGPModel(train_x, train_y, gpytorch.likelihoods.GaussianLikelihood())
# Train model
model.train()
model.likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(model.likelihood, model)
for i in range(50):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()
model.eval()
start_time = time.time()
preds = model.likelihood(model(test_x))
print(time.time() - start_time)
start_time = time.time()
with gpytorch.settings.fast_computations(solves=False):
preds = model.likelihood(model(test_x))
print(time.time() - start_time)
With fast computations turned off, predictions are calculated approximately 15 times faster. Similarly to @jimmyrisk's example, this example uses a small number of dimensions (2), but it uses a larger number of training points (4000)
Hmm okay maybe we should re-think some of our conditions for when we switch to using CG.
I'm going to open up another issue to document this.
Ahhh @laurence-kobold part of the problem is that you are not using the fast_pred_var
context manager (when fast solves is off). This makes the two times much more comparable, but we should still adjust our internal logic for when we use Cholesky versus when we use iterative methods.
🐛 Bug
I am finding that including
with gpytorch.settings.fast_computations(covar_root_decomposition=False, log_prob=False, solves=False):
unexpectedly improves runtime by 5x (and produces different MLL value).I will provide the full reproducible code at the bottom, but here is a rough explanation of what I am encountering. For reference,
train_x
is 1050x3, andtrain_y
is1050x1
.Normal Settings
With gpytorch.settings.fast_computations(covar_root_decomposition=False, log_prob=False, solves=False):
Differences in mll
Expected Behavior
As documented,
gpytorch.settings.fast_computations(covar_root_decomposition=False, log_prob=False, solves=False)
utilizes Cholesky decompositions which in turn (I believe) is supposed to increase accuracy at the expense of increasing runtime.System information
Please complete the following information:
Additional context
The purpose of this simulation is to generate a training set akin to that we use in our mortality modelling research (hence age year cohort), and pick a plausible kernel, thereafter simulating (from the prior) synthetic mortality y's, in which, we try to recover the plausible kernel through comparing likelihoods.
We noticed varying mll computations and tried a few fixes as documented here:
Eventually, I tried
gpytorch.settings.fast_computations(covar_root_decomposition=False, log_prob=False, solves=False)
which, to my astonishment, not only ran much faster, but produced different mll results even when trying massive values for other settings (e.g.with gpytorch.settings.num_trace_samples(1050)
andwith gpytorch.settings.max_preconditioner_size(1050)
.This was also tested with/without
torch.backends.cuda.matmul.allow_tf32 = True
as recommended in https://github.com/cornellius-gp/gpytorch/issues/1960.Questions
gpytorch.settings.fast_computations(covar_root_decomposition=False, log_prob=False, solves=False)
), or the one with normal settings (with largetrace_samples
andpreconditioner_size
)?Full code for reproducibility
Sorry for the massive amount of code, but I knew this example specifically gives the error, so I tried to make it self-contained.
Thanks!