Natural Gradient Optimizer

Hello,

I have seen in the literature for GP algorithms that use inducing points and variational inference to perform inference can sometimes converge slowly when only using optimizers such as stochastic gradient descent. This is apparently due to the difficulty of optimizing the variational parameters as well as the likelihood and kernel parameters. The authors in that paper suggest that using the natural gradient optimizer has shown to have big improvements especially in large-scale GP methods like the SVGP and DeepGP.

I've seen the NatGrad optimizer has already been implemented in this branch within the GPyTorch library. I wanted to know what was everyone's experience using it and why is it not in the main branch? Perhaps the nature of using matrix-vector-multiplication methods don't suffer from the same issue of convergence or joint optimization? Or it's just a matter of code coverage?

cornellius-gp / gpytorch

Natural Gradient Optimizer #894