Open wjmaddox opened 5 years ago
Attempting to decompose the matrix multiplication as
mm3 = gpytorch.lazy.MatmulLazyTensor(a_root.root, gpytorch.lazy.MatmulLazyTensor(a_root.root.t(), mm2.right_lazy_tensor))
mm3.diag()
seems to be similarly computationally intensive suggesting it could be a problem with indexing?
The problem here is that it's using the default diag
behavior when the tensors in a MatmulLazyTensor aren't non lazy, because in general evaluating can make things slower.
With that said, you are creating a full rank RootLazyTensor, for which operations are unlikely to be faster than just creating a NonLazyTensor out of a @ a.t()
-- is there a reason you are doing this? In that case I'd probably recommend making a NonLazyTensor
out of a @ a.t()
and b
, and then making a MatmulLazyTensor
out of those two.
I noticed the issue in computing predictive variances in Bayesian linear regressions (which expand into parameter space), where it seems like the root lazy tensor part is a distraction.
The actual code is pretty much like:
# not actually a nonlazy tensor but for demonstration purposes
b = gpytorch.lazify(torch.randn(1000, 1761))
# get something that looks like I - SS'
a_root = gpytorch.lazy.RootLazyTensor(torch.randn(1761, 90))
a_sub_part = (-1 * a_root).add_jitter(1.)
# predictive covariance is b (I - ss') b'
covar = b @ (a_sub_part @ b.t())
# get diagonal of the matrix
covar.diag()
Given that my actual number of features is considerably larger than 1761, it would be nice to not have to evaluate the p x p matrix if possible.
If you want to do solves with S'S + I
, where S
is n x d
, you can do (S'S + I)^{-1} = I - S'(I_{n} + SS')^{-1}S
, which lets you solve an n x n
system instead of a d x d
system. You can use identities like this for both the predictive mean and variance.
If both n
and d
are very large and roughly equal, then you are just generally in trouble / unlikely to get much faster speed out of linear regression than any other GP without using variational inference or other methods for stochastic optimization.
Yes, I know that. In this particular situation, although d >> n
, it is the case that SS'v
is faster than S'Sv
. Here S
corresponds to the Jacobian matrix of a neural net (edit: obviously, I'm accessing it implicitly).
A less memory intensive solution is to split the covariance matrix:
covar2 = gpytorch.lazy.RootLazyTensor(b) + -1 * gpytorch.lazy.RootLazyTensor(b @ a_root.root)
covar2.diag()
🐛 Bug
To reproduce
Code snippet to reproduce
By comparison, using matrices explicitly takes about 0.2 seconds on a CPU.
Stack trace/error message
I've done my best to track this down as .diag() seems to call the MatmulLazyTensor _getindices method, which expands out the matrices (see here ). The call then goes into RootLazyTensors _get_indices method which expands out the sizing to be quite large - e.g. right_tensor has shape
torch.Size([1, 1761, 1761])
and left_tensor has shapetorch.Size([1761, 1, 1761])
.See the source here
Expected Behavior
I think MatmulLazyTensor should evaluate itself with matmuls against
torch.eye
but it's not clear if that would be faster. But it's not clear what the fix should be.System information
Please complete the following information:
Additional context
See above.