cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 561 forks source link

[Bug] Diag() on MatmulLazyTensor using RootLazyTensors is expensive #847

Open wjmaddox opened 5 years ago

wjmaddox commented 5 years ago

🐛 Bug

To reproduce

Code snippet to reproduce

a = torch.randn(1761, 1761)
b = torch.randn(1761, 1761)

a_root = gpytorch.lazy.RootLazyTensor(a)
mm2 = gpytorch.lazy.MatmulLazyTensor(a_root, gpytorch.lazify(b))

start = time.time()
mm2.diag()
end = time.time() - start
# This takes 84.2 seconds.

((a @ a.t()) @ b).diag()
# this takes 0.2 seconds.

By comparison, using matrices explicitly takes about 0.2 seconds on a CPU.

Stack trace/error message

I've done my best to track this down as .diag() seems to call the MatmulLazyTensor _getindices method, which expands out the matrices (see here ). The call then goes into RootLazyTensors _get_indices method which expands out the sizing to be quite large - e.g. right_tensor has shape torch.Size([1, 1761, 1761]) and left_tensor has shape torch.Size([1761, 1, 1761]).

See the source here

Expected Behavior

I think MatmulLazyTensor should evaluate itself with matmuls against torch.eye but it's not clear if that would be faster. But it's not clear what the fix should be.

System information

Please complete the following information:

Additional context

See above.

wjmaddox commented 5 years ago

Attempting to decompose the matrix multiplication as

mm3 = gpytorch.lazy.MatmulLazyTensor(a_root.root, gpytorch.lazy.MatmulLazyTensor(a_root.root.t(), mm2.right_lazy_tensor)) mm3.diag()

seems to be similarly computationally intensive suggesting it could be a problem with indexing?

jacobrgardner commented 5 years ago

The problem here is that it's using the default diag behavior when the tensors in a MatmulLazyTensor aren't non lazy, because in general evaluating can make things slower.

With that said, you are creating a full rank RootLazyTensor, for which operations are unlikely to be faster than just creating a NonLazyTensor out of a @ a.t() -- is there a reason you are doing this? In that case I'd probably recommend making a NonLazyTensor out of a @ a.t() and b, and then making a MatmulLazyTensor out of those two.

wjmaddox commented 5 years ago

I noticed the issue in computing predictive variances in Bayesian linear regressions (which expand into parameter space), where it seems like the root lazy tensor part is a distraction.

The actual code is pretty much like:

# not actually a nonlazy tensor but for demonstration purposes
b = gpytorch.lazify(torch.randn(1000, 1761))

# get something that looks like I - SS'
a_root = gpytorch.lazy.RootLazyTensor(torch.randn(1761, 90))
a_sub_part = (-1 * a_root).add_jitter(1.)

# predictive covariance is b (I - ss') b'
covar = b @ (a_sub_part @ b.t())

# get diagonal of the matrix
covar.diag()

Given that my actual number of features is considerably larger than 1761, it would be nice to not have to evaluate the p x p matrix if possible.

jacobrgardner commented 5 years ago

If you want to do solves with S'S + I, where S is n x d, you can do (S'S + I)^{-1} = I - S'(I_{n} + SS')^{-1}S, which lets you solve an n x n system instead of a d x d system. You can use identities like this for both the predictive mean and variance.

If both n and d are very large and roughly equal, then you are just generally in trouble / unlikely to get much faster speed out of linear regression than any other GP without using variational inference or other methods for stochastic optimization.

wjmaddox commented 5 years ago

Yes, I know that. In this particular situation, although d >> n, it is the case that SS'v is faster than S'Sv. Here S corresponds to the Jacobian matrix of a neural net (edit: obviously, I'm accessing it implicitly).

A less memory intensive solution is to split the covariance matrix:


covar2 = gpytorch.lazy.RootLazyTensor(b) + -1 * gpytorch.lazy.RootLazyTensor(b @ a_root.root)
covar2.diag()