[Docs] Variational inference and sparse approach

cherepanovic commented 4 years ago

Which papers are behind the variational inference and the sparse approach of gpytorch?

on the doc site are following references given:

Gardner, Jacob R., Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. ” GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration.” In NeurIPS (2018).
Pleiss, Geoff, Jacob R. Gardner, Kilian Q. Weinberger, and Andrew Gordon Wilson. “Constant-Time Predictive Distributions for Gaussian Processes.” In ICML (2018).
Gardner, Jacob R., Geoff Pleiss, Ruihan Wu, Kilian Q. Weinberger, and Andrew Gordon Wilson. “Product Kernel Interpolation for Scalable Gaussian Processes.” In AISTATS (2018).
Wilson, Andrew G., Zhiting Hu, Ruslan R. Salakhutdinov, and Eric P. Xing. “Stochastic variational deep kernel learning.” In NeurIPS (2016).
Wilson, Andrew, and Hannes Nickisch. “Kernel interpolation for scalable structured Gaussian processes (KISS-GP).” In ICML (2015).
Hensman, James, Alexander G. de G. Matthews, and Zoubin Ghahramani. “Scalable variational Gaussian process classification.” In AISTATS (2015).

now, some changes will be done, could you please give some information about the changes. I am using gpytorch and I need it for my documentation.

thanks a lot!

jacobrgardner commented 4 years ago

We implement lots of methods from lots of different papers for approximate GPs! The Hensman one together with "Gaussian Processes for Big Data" are the right references for SVGP style variational inference, KISS-GP is the appropriate reference if you are using KISS-GP. The LOVE paper gets used if you are using the fast_pred_var setting, etc.

Beyond the ones you list, we also implement SGPR from Titsias et al, 2009 and some newer stuff like the recent "robust SVGP" paper and a paper we've recently written on fitting predictive distributions.

cherepanovic commented 4 years ago

Hello Jake (@jacobrgardner)

which approach is behind the CholeskyVariationalDistribution and (outdated) AdditiveGridInterpolationVariationalStrategy in terms of placement of inducing points and learning of it (DKL CIFAR classification example)?

cherepanovic commented 4 years ago

it seem to be the KISS-GP, isnt?

jacobrgardner commented 4 years ago

@cherepanovic AdditiveGridInterpolation refers to SKI with the specific assumption that the kernel decomposes fully additively, e.g., k(x, x') = \sum_{i} k([x]_{i}, [x']_{i}), where [x]_{i} denotes the ith feature in `x.

CholeskyVariationalDistribution just refers to a specific parameterization of the variational distribution, q(u) = N(m, S), where we parameterize S with a full covariance matrix S = LL' (e.g., L are the learned parameters).

weiyadi commented 4 years ago

In variational strategy, the way to compute q(f) seems wrong. Comments and implementations seem to be mismatched. For interp_term, it should be cholesky_solve instead of triangular_solve.

gpleiss commented 4 years ago

For VariationalStrategy triangular_solve is the correct choice. We reparameterize the system - and deal with u' = K^{-1/2} u rather than u. If q(u) = N(m, S), then q(u') = N( m', S' ), where m' = K^{-1/2} m and S' = K^{-1/2} LL^T K^{-1/2} ). The VariationalDistribution directly learns m' and S'.

Given this, the rest of the math in VariationalStrategy should be correct. This is a standard optimization trick called "whitening." GPFlow also uses this trick in their variational code.

cherepanovic commented 4 years ago

Salut @gpleiss, @jacobrgardner,

how will be the parameters initialized in SV-DKL? I have not found it in paper. Thanks!

cherepanovic commented 4 years ago

I mean these parameters

model.gp_layer.hyperparameters()
model.gp_layer.variational_parameters()
likelihood.parameters()

gpleiss commented 4 years ago

@cherepanovic

the hyperparameters are the kernel paraams. The kernels all take sensible defaults (e.g. 1.0 lengthscale and 1.0 outputscale).
the variational parameters are initialized so that the variational distribution is equal to the prior distribution, with a little bit of gaussian noise
the likelihood initializes the mixing weights with a standard normal random variable.

These are sensible defaults that are encoded directly into the GPyTorch kernel, variational, and likelihood modules.

cherepanovic commented 4 years ago

Thank you @gpleiss,

my last question is about GP posterior. The SV-DKL points the marginal likelihood for kernel learning. Regarding the prediction only the A(f)y was referred to. I am missing the GP posterior in this paper and sampling the f from posterior.

Thanks a lot!

gpleiss commented 4 years ago

Regarding the prediction only the A(f)y was referred to.

Not sure what you are referring to here.

I am missing the GP posterior in this paper and sampling the f from posterior.

The GP posterior is approximated by the variational distribution q(f), which is a multivariate normal with mean m and covariance S. These parameters are learned by optimizing the variational ELBO. Sampling from q(f) gives you (approximate) samples from the gp posterior. You cannot get exact posterior samples with SV-DKL (because we are using variational inference).

cherepanovic commented 4 years ago

@gpleiss

Not sure what you are referring to here.

I refer to (1) from SV-DKL paper:

bla

q(f) ~ p(f|u) = Mu (?)

I am missing the sampling from Posterior in the SV-DKL paper. I assume that functions f are samples from the posterior distribution in the Equation (1), right?

gpleiss commented 4 years ago

(1) is the likelihood function for multi-class classification - not the posterior.

From the paper, the variational distribution for f is N( M \mu, M S M^T). You can draw samples from this distribution using the re-parameterization trick (see the bottom of page 4). First, draw samples \epsilon ~ N(0, I), and then M( \mu + L \epsilon ) should be a sample from the variational distribution of f (here, L is the Cholesky factor of S).

cherepanovic commented 4 years ago

in the likelihood function (1), are the f functions not sampled from the posterior?

gpleiss commented 4 years ago

The likelihood is conditioned on f (it's p(y | f)). This equation is true regardless of where f comes from (e.g. as a sample from the prior, a sample from the variational distribution, etc.)

gpleiss commented 4 years ago

At this point I'm going to go ahead and close this issue. We're trying to make sure that issues in this repo relate to the actual gpytorch software. The conversation in this issue seems to now be about questions related to a paper/algorithm rather than its gpytorch implementation.

cornellius-gp / gpytorch

[Docs] Variational inference and sparse approach #967