Closed cherepanovic closed 4 years ago
We implement lots of methods from lots of different papers for approximate GPs! The Hensman one together with "Gaussian Processes for Big Data" are the right references for SVGP style variational inference, KISS-GP is the appropriate reference if you are using KISS-GP. The LOVE paper gets used if you are using the fast_pred_var
setting, etc.
Beyond the ones you list, we also implement SGPR from Titsias et al, 2009 and some newer stuff like the recent "robust SVGP" paper and a paper we've recently written on fitting predictive distributions.
Hello Jake (@jacobrgardner)
which approach is behind the CholeskyVariationalDistribution and (outdated) AdditiveGridInterpolationVariationalStrategy in terms of placement of inducing points and learning of it (DKL CIFAR classification example)?
it seem to be the KISS-GP, isnt?
@cherepanovic AdditiveGridInterpolation
refers to SKI with the specific assumption that the kernel decomposes fully additively, e.g., k(x, x') = \sum_{i} k([x]_{i}, [x']_{i})
, where [x]_{i}
denotes the ith feature in `x.
CholeskyVariationalDistribution
just refers to a specific parameterization of the variational distribution, q(u) = N(m, S)
, where we parameterize S
with a full covariance matrix S = LL'
(e.g., L
are the learned parameters).
In variational strategy, the way to compute q(f) seems wrong. Comments and implementations seem to be mismatched. For interp_term, it should be cholesky_solve instead of triangular_solve.
For VariationalStrategy
triangular_solve is the correct choice. We reparameterize the system - and deal with u' = K^{-1/2} u
rather than u
. If q(u) = N(m, S)
, then q(u') = N( m', S' )
, where m' = K^{-1/2} m
and S' = K^{-1/2} LL^T K^{-1/2} )
. The VariationalDistribution directly learns m'
and S'
.
Given this, the rest of the math in VariationalStrategy
should be correct. This is a standard optimization trick called "whitening." GPFlow also uses this trick in their variational code.
Salut @gpleiss, @jacobrgardner,
how will be the parameters initialized in SV-DKL? I have not found it in paper. Thanks!
I mean these parameters
model.gp_layer.hyperparameters()
model.gp_layer.variational_parameters()
likelihood.parameters()
@cherepanovic
the hyperparameters are the kernel paraams. The kernels all take sensible defaults (e.g. 1.0 lengthscale and 1.0 outputscale).
the variational parameters are initialized so that the variational distribution is equal to the prior distribution, with a little bit of gaussian noise
the likelihood initializes the mixing weights with a standard normal random variable.
These are sensible defaults that are encoded directly into the GPyTorch kernel, variational, and likelihood modules.
Thank you @gpleiss,
my last question is about GP posterior. The SV-DKL points the marginal likelihood for kernel learning. Regarding the prediction only the A(f)y was referred to. I am missing the GP posterior in this paper and sampling the f from posterior.
Thanks a lot!
Regarding the prediction only the A(f)y was referred to.
Not sure what you are referring to here.
I am missing the GP posterior in this paper and sampling the f from posterior.
The GP posterior is approximated by the variational distribution q(f)
, which is a multivariate normal with mean m
and covariance S
. These parameters are learned by optimizing the variational ELBO. Sampling from q(f)
gives you (approximate) samples from the gp posterior. You cannot get exact posterior samples with SV-DKL (because we are using variational inference).
@gpleiss
Not sure what you are referring to here.
I refer to (1) from SV-DKL paper:
q(f) ~ p(f|u) = Mu (?)
I am missing the sampling from Posterior in the SV-DKL paper. I assume that functions f are samples from the posterior distribution in the Equation (1), right?
(1) is the likelihood function for multi-class classification - not the posterior.
From the paper, the variational distribution for f
is N( M \mu, M S M^T
). You can draw samples from this distribution using the re-parameterization trick (see the bottom of page 4). First, draw samples \epsilon ~ N(0, I), and then M( \mu + L \epsilon )
should be a sample from the variational distribution of f
(here, L is the Cholesky factor of S).
in the likelihood function (1), are the f functions not sampled from the posterior?
The likelihood is conditioned on f
(it's p(y | f)
). This equation is true regardless of where f
comes from (e.g. as a sample from the prior, a sample from the variational distribution, etc.)
At this point I'm going to go ahead and close this issue. We're trying to make sure that issues in this repo relate to the actual gpytorch software. The conversation in this issue seems to now be about questions related to a paper/algorithm rather than its gpytorch implementation.
Which papers are behind the variational inference and the sparse approach of gpytorch?
on the doc site are following references given:
now, some changes will be done, could you please give some information about the changes. I am using gpytorch and I need it for my documentation.
thanks a lot!