How to do hyperparameter tuning

simsurace commented 2 years ago

I tried to AD aug_elbo in the NegBinomialLikelihood example, i.e. (removed unnecessary bits), purposefully avoiding ParameterHandling.jl and trying only with ForwardDiff.gradient

# # Negative Binomial

# We load all the necessary packages
using AbstractGPs
using ApproximateGPs
using AugmentedGPLikelihoods
using Distributions
using ForwardDiff # <-- try this first
using LinearAlgebra

# We create some random data (sorted for plotting reasons)
N = 100
x = range(-10, 10; length=N)
kernel = with_lengthscale(SqExponentialKernel(), 2.0)
gp = GP(kernel)
lik = NegBinomialLikelihood(15)
lf = LatentGP(gp, lik, 1e-6)
f, y = rand(lf(x));

# ## ELBO
# How can one compute the Augmented ELBO?
# Again AugmentedGPLikelihoods provides helper functions
# to not have to compute everything yourself
function aug_elbo(lik, u_post, x, y)
    qf = marginals(u_post(x))
    qΩ = aux_posterior(lik, y, qf)
    return expected_logtilt(lik, qΩ, y, qf) - aux_kldivergence(lik, qΩ, y) -
           kldivergence(u_post.approx.q, u_post.approx.fz)     # approx.fz is the prior and approx.q is the posterior 
end

function u_posterior(fz, m, S)
    return posterior(SparseVariationalApproximation(Centered(), fz, MvNormal(m, S)))
end

# ## Try to differentiate loss function

function makeloss(x, y)
    N = length(x)
    function loss(θ)
        k = ScaledKernel(
            RBFKernel() ∘ ScaleTransform(inv(θ[1])), 
            θ[2]
        )
        gp = GP(k)
        lik = NegBinomialLikelihood(θ[3])
        fz = gp(x, 1e-8);
        u_post = u_posterior(fz, zeros(N), Matrix{Float64}(I(N)))
        return aug_elbo(lik, u_post, x, y)
    end
end

θ = [1., 1., 15.]

loss = makeloss(x, y)
loss(θ) # works!
ForwardDiff.gradient(loss, θ) # MethodError

There is an easy fix (happy to open a PR): change the definition of aux_posterior as

function aux_posterior(lik::NegBinomialLikelihood, y, f)
    c = sqrt.(second_moment.(f))
    return For(TupleVector(; y=y, c=c)) do φ
        NTDist(PolyaGamma(φ.y + lik.r, φ.c)) # Distributions uses a different parametrization
    end
end

julia> ForwardDiff.gradient(loss, θ)
3-element Vector{Float64}:
  5.790557942012172e7
 -1.9761748845444782e9
 16.184871970106013

BTW: is it expected that the values of the augmented ELBO are so much larger in magnitude than the normal ELBO?

theogf commented 2 years ago

Hi @simsurace,

Sorry, I did not answer (and save you some time) but I got Covid-stroke. Actually aux_posterior! should not be differentiable! It should be ignored when doing the AD pass. When using Zygote, I pass the block in a Zygote.@ignore, I don't know if it's possible to do the same with ForwardDiff though. The reason is that the aux_posterior step is already an implicit optimization.

simsurace commented 2 years ago

Thanks, I did not notice that this is an implicit optimization. So is it independent of hyperparameters? If yes, this and the PR #77 would be unnecessary. I will give it a try and see if the results change! Sorry to hear about your illness. Hope you get better soon.

simsurace commented 2 years ago

If it works out with the ignore statements, I could convert the PR into a documentation thing where this is explained.

theogf commented 2 years ago

Thanks, I did not notice that this is an implicit optimization. So is it independent of hyperparameters? If yes, this and the PR #77 would be unnecessary. I will give it a try and see if the results change! Sorry to hear about your illness. Hope you get better soon.

That's an interesting question actually it depends on the parametrization. Right now I am parametrizing with m and S, mean and covariance. But one could parameterize the covariance as (K^{-1} + D)^{-1} and similarly for the mean, there one could optimize the hyperparameters as well but that's a more complicated matter.

In summary, for full GPs, the kernel parameters only matters for the KL(q(f)||p(f)) and for sparse GPs they also are influenced in the expected log likelihood, but that's it.

simsurace commented 2 years ago

Just to clarify my understanding: The qΩ = aux_posterior(lik, y, qf) should be ignored by the AD system, even though lik and qf depend on the parameters such as likelihood parameters, inducing point locations and variational parameters that one wants to optimize over?

theogf commented 2 years ago

Oh yeah sorry, somehow I got confused with the updates on q(f). But it's the same thing. qΩ is optimized via aux_posterior and once this is obtained we can compute the ELBO and optimize the rest of the other hyper-parameters

simsurace commented 2 years ago

EDIT: Ah I think I now understood, one should not expose the variational parameters to the optimizer, but have an internal CAVI loop for them.

Still struggling to make it work though. Do you have a working example for hyperparameter optimization of the augmented ELBO?

No hurry though. This is not very urgent, but it would be nice to make this work and compare it to the normal SVGP optimization loop for speed.

JuliaGaussianProcesses / AugmentedGPLikelihoods.jl

How to do hyperparameter tuning #76