UCL-SML / Doubly-Stochastic-DGP

Deep Gaussian Processes with Doubly Stochastic Variational Inference
Apache License 2.0
145 stars 48 forks source link

Using natural gradient for the distibution of inducing variables in the inner layers #32

Open Hebbalali opened 5 years ago

Hebbalali commented 5 years ago

Dear Salimbeni,

In your demo you used the natural gradient to optimize the distribution of the inducing variables at the final layer. I tought that it may be interesting to use the natural gradient also for the distribution of the inducing variables in the inner layers. However, i always obtain an error in the Cholesky decomposition. " Cholesky decomposition was not successful. The input might not be valid." Which i never obtain when using the natural gradient only for the the final layer. Did you encounter this problem? Thank you in advance.

hughsalimbeni commented 5 years ago

Yes you can certainly do this, but you might end up taking overly large step. I've played around with this myself a bit, and it's not totally straightforward to make work without tuning. If you only use natural gradients for the final layer with a Gaussian likelihood things are easy (e.g. gamma=0.05 will likely work) . For non-Gaussian likelihoods or inner layers the method still works, but care is needed to not use a too large step size. If you see a cholesky failed error it's probably for this reason. A simple thing to try is reducing the stepsize. Alternatively some adaptive method can work well.

hughsalimbeni commented 5 years ago

An update: I've done some more experiments with this and have found that nat grads can work well if tuned correctly, but can sometimes converge to poor local optima. This is to be expected, I think, since the optimization of the inner layer is potentially highly multimodal, so momentum based optimization might find better a optimum in practice.

Hebbalali commented 5 years ago

I've made some more experiments and the interesting observation is that when it is possible to use nat grads for the inner layers with a relative high gamma (0.1) the results are better than nat grad only in the last layer, and this in terms of ELBO, better uncertainty estimation and better prediction of the final model. (My experiments were restricteed to the two layers case) however, when i'm restricted to a small gamma the nat grad for the inner layers is very slow to converge, which is expected. My guess is if we can keep a good gamma value for the nat grad for all the inner layers, we may have better results. The issue of the chelosky decomposition comes from the ill conditionned updated nat2 parameter when we try to transform it to the covariance parameter . I was thinking if it was possible that instead of updating the nat2 parameter, we update its square root matrix L_nat2. However, this means that the gradient must be taken with respect to L_nat2 and not the second natural parameter. But it will constrain the updated nat2 to be positive definitive. Do you think this can help avoiding this problem?

hughsalimbeni commented 5 years ago

The nat grad code allows the nat grad taken with respect to different parameterizations. The default is using the natural parameters as this works best in this paper, but this wasn't assessed for a deep gp. I've never tried different parameterizations for the deep gp but would be interested to know how well it works!