trial sparse approximations with subset of regressors implementation

maximerischard commented 5 years ago

I've just finished implementing a very basic version of the Subset of Regressors (SoR) sparse approximation to GPs. But before I go any further with this, I want to discuss whether this is the right way to implement sparse approximations.

The idea is that sparse approximations will be represented as a special type of positive definite matrix. This makes some sense, as operations involving the covariance matrix are really where the sparsity comes in. One would also hope that this would play nicely with the MCMC approach.

Ideally, one would be able to just define the required methods for AbstractPDMat, like left-division and whiten!, just like @jbrea did for his ElasticPDMat, and then everything would just work. In fact for SoR I found that that implementing \ and logdet was enough to compute the marginal loglikelihood. But for predictions there are three problems:

I do not know how to implement whiten!, which is required for predictions.
The sparse representation also affects how the covariance between the test units and training units is computed.
There are some simplifications for predictions that I just implemented directly in _predict.

I have not yet looked into computing derivatives of the marginal loglikelihood, which will also require some thought.

So the AbstractPDMat abstraction only took me so far. The other thing that bothers me is that the inducing points are stored inside the PDMat object, along with some precomputed matrices. To me this feels frail, as it means the sparse covariance matrix cannot be obtained from scratch, and state is distributed amongst various objects in a way that might be confusing to future developers.

In short, this does work, but I want to get some ideas and thoughts from you before I dive in any further.

An example:

using GaussianProcesses
using Distributions
using LaTeXStrings
using LinearAlgebra

import PyPlot; plt=PyPlot

# simulate some data
""" The true function we will be simulating from. """
function fstar(x::Float64)
    return abs(x-5)*cos(2*x)
end

σy = 10.0
n=10000

Xdistr = Beta(7,7)
ϵdistr = Normal(0,σy)
x = rand(Xdistr, n)*10
Y = fstar.(x) .+ rand(ϵdistr,n)

# fit a regular GP
k = SEIso(log(0.3), log(5.0))
gp = GPE(x', Y, MeanConst(mean(Y)), k, log(σy))
@time gp = GPE(x', Y, MeanConst(mean(Y)), k, log(σy))
# 6.952195 seconds (37 allocations: 2.235 GiB, 1.51% gc time)
# predictions from exact GP
μ_exact, Σ_exact = predict_f(gp, xx; full_cov=true)
@time predict_f(gp, xx; full_cov=true)
# 0.661820 seconds (918.97 k allocations: 76.883 MiB, 3.89% gc time)

# pick some inducing points at random
Xu = Matrix(sample(x, 10; replace=false)')

# fit a sparse GP
gp_SOR = GaussianProcesses.SoR(x', Xu, Y, MeanConst(mean(Y)), k, log(σy));
@time gp_SOR = GaussianProcesses.SoR(x', Xu, Y, MeanConst(mean(Y)), k, log(σy));
# 0.003469 seconds (53 allocations: 2.677 MiB)

gp_SOR.mll, gp.mll
# (-37216.77118274264, -37223.82480148437)

xx = range(0, stop=10, length=200)
μSORgp, ΣSORgp = predict_f(gp_SOR, xx; full_cov=true)
@time predict_f(gp_SOR, xx; full_cov=true);
# 0.001098 seconds (55 allocations: 851.766 KiB)

#=======
# Plot predictions
========#
cbbPalette = ["#E69F00", "#56B4E9", "#009E73", 
                "#F0E442", "#0072B2", "#D55E00", 
                "#CC79A7"];
plt.plot(x,Y, ".", color="black", markeredgecolor="None", alpha=0.1, label="simulated data")
plt.plot(xx, fstar.(xx), color=cbbPalette[2], label=L"f^\star(X)", linewidth=2)
plt.plot(xx, μ_exact, color=cbbPalette[3], label=L"$f(x) \mid Y, \hat\theta$")
plt.fill_between(xx, μ_exact.-sqrt.(diag(Σ_exact)), μ_exact.+sqrt.(diag(Σ_exact)), 
                 color=cbbPalette[3], alpha=0.3)
plt.plot(xx, μSORgp, color=cbbPalette[6], label=L"$f(x) \mid Y, \hat\theta$ SoR")
y_err = sqrt.(diag(ΣSORgp))
plt.fill_between(xx, μSORgp.-y_err, μSORgp.+y_err, 
                 color=cbbPalette[6], alpha=0.5)
plt.xlabel(L"X")
plt.ylabel(L"Y")
plt.title("Subset of Regressors")
plt.legend(loc="upper center", fontsize="small")
plt.ylim(-10,10)
plt.plot(Xu, zeros(Xu).-5, linestyle="None", 
    marker=6, markersize=12, color="black", label=L"inducing variables $X_\mathbf{u}$")

index

(Obviously in this case it's a pretty bad approximation, but that's beside the point!)

chris-nemeth commented 5 years ago

Thanks @maximerischard for making a start on this. @fairbrot and I are meeting for lunch tomorrow so we'll both take a look and give some feedback tomorrow.

maximerischard commented 5 years ago

Thank you for having a look Chris. I think yes, something like FITC would look similar. I'll implement it too once we've settled on the interface.

Regarding autodiff, so far what I've looked into is using autodiff for elements of the covariance matrix. My assumption has been that going from covariance to likelihood, we would continue to use analytical derivatives. Are you thinking autodiff could be useful there too?

I agree this is the right time to think about function names. I also agree that a proliferation of GPxxx methods would end up leading to confusion. Also keep in mind that I'm hoping these approximations will be useful for GPMC as well, so this isn't just a replacement for GPE.

I've thought a bit more about the awkwardness of having important parameters (the inducing points) stored in the AbstractPDMat object. An alternative would be to define an AbstractCovarianceStrategy type, and then subtypes like FullCovariance and SubsetOfRegressors. These strategies would be the first parameter of methods like update_cK and predict, and would also be stored in the GP objects. FullCovariance would be an empty shell, or maybe a few settings that dictate how KernelData is obtained. SubsetOfRegressors would contain the vector of inducing points. This also would allow for different strategies using the same subtype of AbstractPDMat. What do you think?

chris-nemeth commented 5 years ago

I like the idea of different subtypes for AbstractPDMat, @fairbrot - what do you think? Regarding the naming convention and the interface, I wonder if the best option might to create separate functions for FITC, etc. in the same way as you have SubsetOfRegressors, i.e. don't worry about the naming convention at the moment. I recall that you already have most of the code for these other approximations. Once the code is implemented and working it might be easier for us to then see ways of i) simplifying the code by removing common repetitions amongst sparse GP functions, ii) what similarities there are with GPE. Also, we should probably look at other packages to see if we can get some inspiration with how to handle these functions and what might be a sensible naming convention.

maximerischard commented 5 years ago

This is getting closer to being ready to merge. I now have implemented subset of regressors (SoR), deterministic regression conditionals (DTC), and fully independent regression conditionals (FITC) approximations. I've implemented gradients to the marginal likelihood for all of them, thought not with respect to the inducing points. The changes I've made to GPMC are also quite hacky, and the approximations are sadly not yet compatible with MCMC.

chris-nemeth commented 5 years ago

Excellent! Looking forward to seeing these additions included in the package.

STOR-i / GaussianProcesses.jl

trial sparse approximations with subset of regressors implementation #103