Open srinathdama opened 3 years ago
A priori this isn't that surprising to me, although you do want to use conjugate gradients when using the grid kernel to be able to exploit the structure of the kernel itself.
In your example, you have 50^2 = 2500 data points and so not using the grid kernel you need to take a cholesky decomposition of a 2500 x 2500 matrix. If you instead use the grid kernel and use conjugate gradients then you don't need to perform this decomposition due to the properties of the kernel matrix which are being exploited (which is not going to be the case when computing the cholesky).
More specifically, why do you want to use a cholesky decomposition here?
The inverse of gram matrix (K) can be computed without unpacking the Kronecker product, i.e grid-based implementation ideally needs to do find inverse using SVD of two 50 x 50 matrices instead of 2500 x 2500, thereby significantly decreasing the computational complexity [References: Wilson et al, Ch-5 from Saatci PhD thesis.
When I was using a conjugate gradient with grid kernel on other data-set that I have, I was observing converging issues with it even after increasing the CG iterations/tolerance and thought of using another method existing method in gpytorch for finding inverse, which is Cholesky decomposition. Now I realize that in the papers I cited above they were able to find the inverse of K by using SVD instead of Cholesky of individual matrices in the Kronecker product. I am wondering this might be the reason when Cholesky decomposition is used it was unable to exploit the Kronecker structure. Even then I feel the grid kernel method should not take more time than the default kernel method.
I'll look into this some more, my original response might have been a bit incorrect. Looking at the code GridKernel should return a KroneckerProductLazyTensor
, which should use efficient Kronecker solves as you pointed out.
Okay, this reproduces and gives timings on my laptop of ~2.4 s/it for cholesky + grid, ~0.008 s/it for non-cholesky + grid, ~1.4 s/it for cholesky + non-grid, ~0.13 s / it for non-cholesky + non-grid.
Interestingly enough, these correspond to the following settings:
I manually checked and it turns out that non-cholesky + grid (the default settings ultimately) does actually cause the Kronecker solves to get called. The reason why it's so much faster is because we exploit Kronecker algebra and only have to call symeig on two small matrices.
Forcing the maximum cholesky size to be very large overwrites our Kronecker algebra internally -- @Balandat maybe we should make this more clear in the settings?
Kind of related, but note that currently some functions will run slow with fast_computations off. In particular since KronckerProductAddedDiagLazyTensor
doesn't override _cholesky
or inherit from a lazy tensor that does like KroneckerProductLazyTensor
, this'll cause problems with InvMatmul
which currently just explicitly calls lazy_tsr.cholesky
and doesn't care that you've overridden root_decomposition
:
So we'd get really slow behavior with fast_computations
off when we go to compute predictions either way I think.
I wonder if a solution here might be to refactor lazy tensors (or linear operators) to have _iterative_solve
and _direct_solve
so that it's more obvious and intuitive in all situations exactly what is happening? Then "fast computations" (which feels a bit preachy anyways) should be refactored to be a setting that represents what it actually is: should we do solves using an iterative method or a direct method? If a direct method is chosen, we'll still always do it the best way we can.
Right now, I feel like there are a lot of gotchas, and even different functions have different behaviors under different settings (e.g., inv_matmul
could currently be slow even when inv_quad_logdet
is fast).
@wjmaddox, thanks for checking the issue! I will be more cautious next time onwards when disabling fast_computations.
@jacobrgardner, I agree with your suggestion on refactoring the code so that it would be easy to understand the direct or iterative method. For time being, I am using an existing grid-based direct method implemented using scipy.
Thanks again for making this cool software open source!
Problem
I am trying to use Grid Kernels over default Kernals to speed up GPR training. My understanding is that Grid kernels implementation would exploit the tensor algebra and reduce the computational complexity drastically as Cholesky decomposition is used on individual matrices in the Kronecker product. I have disabled fast computations using context manager so that Cholesky decomposition is used instead of Conjugate gradient. I am observing that time taken for each step of training with Grid Kernal is significantly higher than when using default Kernal when fast computations are disabled. When I enable the fast computations (default settings), the computational time is less with Grid Kernal which is expected.
It would be helpful if someone can point me how to use Cholesky decomposition and still get speed up while using Grid Kernel.
Code to reproduce
Below code is taken from the Grid_GP_Regression tutorial. I am using the same data to compare the computational times.
Output by disabling fast computations (
chol_flag = True
while callingtrain_GPR
)Output enabling fast computations (
chol_flag = False
while callingtrain_GPR
)