cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.57k stars 562 forks source link

Inverse & Root for KroneckerLazyVar #103

Closed ssydasheng closed 6 years ago

ssydasheng commented 6 years ago

For KroneckerLazyVariable, which you can easily get exact inverse and root, why you still go for conjugate gradient & lanczos ? These seems be slower to me cuz the true size of matrix is really big, by small steps of iterations, it might be far from a good solution.

jacobrgardner commented 6 years ago

The primary reason is that a KroneckerLazyVariable can represent the Kronecker product of other LazyVariables, which we might not want to evaluate or invert explicitly. This is true, for example, when doing KISS-GP with Kronecker structure on multiple dimensions (https://github.com/cornellius-gp/gpytorch/blob/master/examples/kissgp_kronecker_product_regression.ipynb).

It's true that if KroneckerLazyVariable were over only NonLazyVariables, we could probably avoid CG and Lanczos.

Do you have an example use case where exploiting properties of the Kronecker product other than for MVMs would be useful? We just briefly discussed it and are open to the idea, we just can't think of what we'd gain, or how we'd keep the structure of the sub lazy variables intact.

jacobrgardner commented 6 years ago

It may also just generally be hard to justify using inversion algorithms other than CG at this point, even when they are available. CG typically gives fairly good solves with a very small number of iterations, making it a compelling choice even when an exact inverse is possible. For example, even for exact GPs, k iterations of CG requires O(kn^2) time compared to O(n^3) time to compute a Cholesky decomposition.

Even without preconditioning, k can often be very small, and often depends more on the conditioning of the matrix and the clustering of its eigenvalues rather than on the size of the matrix.

ssydasheng commented 6 years ago

I think the primary gain over MVM is the speed. I tried a 10 dimensional grid, with 10 points in each dim. Root Decomposition is very very slow. I guess although MVM is fast with kronecker structure, the memory cost is still big with a multidimensional grid, which slows down the computation a lot. However, with kronecker root decomposition applied, this should finish instantly.

ssydasheng commented 6 years ago

Maybe I misunderstood the code, why there is a problem when lazy_vars are LazyVariables? For KroneckerProductLazyVariable, InverseLazyVariable or SquareLazyVariable should be able to implemented to be applied to each element. At least CG and Lanczos should be able to applied elementwise, right?

jacobrgardner commented 6 years ago

Do you have an example of how a new KroneckerLazyVariable would look? It would basically need to define a new inv_quad_log_det.

The main cost with Kronecker structure is the exponential dependence on d of the regularly spaced grid; in 10 dimensions at 10 grid points per dimension, that's 10^10 inducing points. At the moment, the two implemented ways around this are the multiplicative grid interpolation kernel which exploits product structure (e.g. in the RBF kernel) and deep kernel learning, although we may implement SGPR as a lazy variable down the road.

It's not immediately obvious to me personally how we can get around the exponential scaling in a Kronecker lazy variable; however, if you believe you have a method, it could be extremely interesting.

ssydasheng commented 6 years ago

Ah, I missed out Multiplicative grid interpolation kernel before. Thank you.