GAMES-UChile / mogptk

Multi-Output Gaussian Process Toolkit
MIT License
161 stars 42 forks source link

Confidence interval with non-Gaussian likelihoods #44

Closed tdewolff closed 9 months ago

tdewolff commented 2 years ago

The variance of non-Gaussian likelihoods is not useful when predicting y values since the likelihood may be non-symmetric. Instead we should return the confidence interval or certain quantiles. Even better, return an approximation by sampling with MCMC.

See: https://arxiv.org/pdf/1206.6391.pdf ?

See https://github.com/GPflow/GPflow/issues/1345

tdewolff commented 9 months ago

See https://proceedings.neurips.cc/paper_files/paper/2015/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf

We use variational expectation to train the GP with non-Gaussian likelihoods (see Hensman). Thus our function values are approximated by a Gaussian q(f), but our likelihood p(y|f) is not Gaussian. For the predictive posterior of the y values (and not function values), we currently use Gauss-Hermite quadratures to estimate the mean and variance of the y values, however the variance doesn't make much sense for many of the likelihoods (such as for Bernoulli or LogNormal).

Instead, we should work with confidence intervals and not (normal) variances. This confidence interval can be ±sigma (ie. 67%) by default which we already use for the Gaussian likelihood. For non-Gaussian likelihoods, we should estimate the boundaries of the confidence interval. For that we need to sample from p(y*|y) = int p(y*|f*) p(f*|y) df* with p(f*|y) ≈ q(f*)~N(mu, var). Here mu and var are already known from the trained GP model, and p(y*|f*) is our non-Gaussian likelihood.

We can calculate the mean of p(y*|y) using Gauss-Hermite quadratures that approximate the integral, but for the two confidence boundaries we need to sample perhaps using MCMC. How?

tdewolff commented 9 months ago

Support for confidence intervals has been added with https://github.com/GAMES-UChile/mogptk/commit/665e3d385ef6de5d22e7b3e9e7a66449db667ef0

It uses a (naive) method by sampling n times from the GP posterior p(f*|y) and then using those values to sample from the likelihood p(y*|f*). Then sort the values and we basically have our crude CDF from which we can find any quantile as simple as indexing it. This requires a large number of n which makes it slow and memory hungry. I plan to see if MCMC can give us a more accurate representation of the CDF with (much) fewer points.

tdewolff commented 9 months ago

Upon further investigation, the currently implemented technique of using Monte Carlo approximation of the integral is actually the best way to approximate the confidence intervals. Since we use variational expectation and approximate the GP's posterior as a Normal distribution, and since our likelihood is known, we can easily sample from the posterior and then sample from the likelihood (which is conditional to the posterior). These samples are obtained with a probability of the joint distribution of the GP's posterior with the likelihood, and thus can be used directly to perform MC to calculate p(y*) = int p(y*|f*) q(f*) df*.

In order to convince myself, I've written en Gibbs sampler (MCMC for conditional distributions) using the Metropolis-Hastings algorithm, and it gives the same result as simply sampling randomly (MC) but is much much slower. Both approach the target distribution. See https://github.com/GAMES-UChile/mogptk/blob/master/tests/mcmc.py

Since the random sampler is already implemented, this issue is then resolved! See commit above.