JuliaGaussianProcesses / AbstractGPs.jl

Abstract types and methods for Gaussian Processes.
https://juliagaussianprocesses.github.io/AbstractGPs.jl/dev
Other
210 stars 20 forks source link

An abstraction for the realization of a GP #364

Open FelixBenning opened 1 year ago

FelixBenning commented 1 year ago

So in the documentation there is the sentence:

An AbstractGP, f, should be thought of as a distribution over functions. This means that the output of rand(f) would be a real-valued function. It's not usually possible to implement this though, so we don't.

When I first encountered AbstractGPs.jl I thought that the abstractions you define were simply born out of a different opinion. But reading this sentence I think it this might actually be a fruitful discussion. Because I disagree. It is absolutely possible to implement rf = rand(f).

The way you do it is lazily. I mean you could also not implement f(x) = 2x if that meant you had to calculate f for every possible x right from the start. The way you are in fact able to implement f anyway is, by implmementing the process to get from x to f(x) and only apply this process once you actually want to know f(x) for a particular x.

Now in my particular case (gradient descent on a random function), I needed to incrementally evaluate this abstract random function rf = rand(f). If I wanted to do this with AbstractGPs.jl it would look something like this (I am going to try and ignore the fact that AbstractGPs is also not intended for the simulation of gradients and pretend that rfrepresents the gradient vector field for simplicity):

# Gradient Descent with 
x0 = zeros(n)
x = [x0]
f = GP(kernel)
grads = [rand(f(x))]
while true
     append!(x, x[end] + step_size * grads[end])
     append!(grads, rand(posterior(f, grads), x[end:end]))
     if norm(grads[end]) < eps
            break
     end
end

This is anything but natural. Now in comparison consider this:

x = zeros(n)
rf = rand(GP(kernel))
while true
      grad = rf(x)
      x += step_size * grad
      if norm(grad) < eps
          break
     end
end

The above is essentially what I implemented in https://github.com/FelixBenning/IncrementalGRF.jl. Only that I skip the rand(GP(kernel)) and essentially treat GP(kernel) as the realization of a random function immediatly. Well in my case it is called GaussianRandomFunction(kernel) but that is besides the point. It essentially works like this:

Whenever rf(x) is called, the value rf(x) is simulated based on a stored vector of previous evaluations to which x is added at the end. To ensure this is compuationally efficient, I also store the running cholesky decomposition (which can easily be extened by another row). Since the cholesky decomposition does the same incremental extension anyway, calculating it incrementally turns out to be no more expensive complexity wise. The only implementation issue is to allow for an extendible matrix. Since we are interested in a triangular matrix though, we can do that using an extensible vector by simply appending rows resulting in a packed storage format. Unfortunately this format is only supported by BLAS2 routines and not BLAS3, so at the moment I am loosing a little bit of performance (but half the memory requirements). I want to fix this issue with packed block matrices in the future.

But anyway: What do you think about this way to implement random functions? If you wanted to get the conditional expectation instead of new evaluations, you could easily freeze such a random function e.g.

ce = conditionalExpection(rf)
ce(x) # = E[ rf(x) | rf.(previous_points)]

oh yeah and you also get the broadcast operator for free: rf.(evaluatioin_points) totally makes sense if you want to evaluate rf at multiple points. What do you think about this interface? Are there use cases where this interface is more clunky than the current one of AbstractGPs?

willtebbutt commented 1 year ago

This is a good point -- I agree that the statement that you highlight is not correct.

Before proceeding, I think it's worth noting that you can incrementally update a posterior (I believe we've implemented this in a reasonably performant manner, but if you think there's room for improvement, your assistance would be greatly appreciated!), so your example that uses the current interface can be simplified somewhat:

x = zeros(n)
f = GP(kernel)
grads = [rand(f(x))]
while true
    grad = rand(f(x))
    f = posterior(f(x), grad)
    x += step_size * grad
    if norm(grads[end]) < eps
        break
    end
end

My overall thoughts on this kind of thing is that it's very cool, but I'm not convinced that the use-case you point out provides an incredibly strong example of where it's useful, because there's a good clean solution with the current interface (albeit I agree that the solution that you propose is a little bit cleaner). @rossviljoen do you have thoughts on this? In terms of the interface, it feels like it addresses identical concerns to what the pathwise sampling stuff does, which I know you were looking at for a bit.

@theogf @st-- do you have thoughts on this kind of thing?

I'm broadly keen to avoid extending the interface further, because it's already quite substantial, but I could definitely be convinced. I think potentially rand(f) makes sense to have, but I'm not convinced by the conditionalExpectation function proposed -- I would want a strong example use case to be convinced of that.

rossviljoen commented 1 year ago

I think rand(::AbstractGP) might be too specific to certain methods to include in the general interface. That is, it works for GP posteriors computed by Cholesky and it works for Bayesian linear regression, but I'm not sure it makes sense for a lot of other implementations (most approximations, although I could be wrong about this).

So, perhaps it's best to just implement rand for specific GPs where it's possible (for example) rather than have it in the API?

I do think this particular implementation would be nice to have for GP though.

st-- commented 1 year ago

There might be scope for encapsulating something like what Will posted within an interface similar to rand(f), but as for approximate inference in non-conjugate GP models, there's different ways one might want to handle the approximation to sampling an infinite-dimensional object, and I would be in favour of something that makes that more explicit whilst leaving scope for future extensions.

st-- commented 1 year ago

(The "call" API appears like it ought to be stateful, and one downside of the Cholesky approach is that this is not the case - each subsequent call becomes more expensive - and it feels like this should be an explicit ! function then.)

FelixBenning commented 1 year ago

The conditionalExpectation function was less of an actual implementation suggestion, since you have predict_f already, and more of me trying to explain why I did not see a reason why the intermediate step f = GP() was needed before continuing on to rf = rand(f). I tried to demonstrate, that you could still get features like the conditional expectation without this intermediate step. But I guess this is a distractions since a removal of this intermediate step is likely not on the table given the effort this would cause.

The fact that the cholesky approach makes every subsequent call more expensive is unfortunate. But I personally don't think this is a good argument against the callable approach. This problem exists no matter what, there is no reason to burden yourself with an interface to constantly remind yourself of it.

different ways one might want to handle the approximation to sampling an infinite-dimensional object, and I would be in favour of something that makes that more explicit whilst leaving scope for future extensions

I only have a passing familiarity with methods other than the cholesky decomposition for the simulation of random functions as they all appeared to be unsuitable for high dimensional functions which I am interested in. But the gist I got from them is, that they generally try to simulate the entire rf in one go using a functional basis. I am thinking of the turning band method or spectral representation for example. So I would think the interface would often become even more natural. But maybe I misunderstand and "approximations" referes to something else entirely.

But even if there are different ways to implement this, having different ways to handle something can be overwhelming and a good package should provide reasonable defaults in my opinion. Similar to the way Plots provides you with a default backend but lets you change it, it might be possible to provide backends for this implementation. In this case it might still make sense to implement this as a different package (which then becomes a backend later on). This would also allow this to be more experimental.

For example:


rf = RandomFunction(kernel, backend=Exact.Cholesky())
rf = RandomFunction(mean, kernel, backend=Approximations.SpectralDecomposition())

RandomFunction could use GP under the hood as a mixin. This would allow you to fallback whenever you need to. E.g. the GP could be stored in rf.gp and would include the kernel and mean. The state (previous evaluations, cholesky decomposition,...) would be stored as different fields.

FelixBenning commented 1 year ago

you can incrementally update a posterior (I believe we've implemented this in a reasonably performant manner, but if you think there's room for improvement, your assistance would be greatly appreciated!

So I had a look at update_chol

function update_chol(chol::Cholesky, C12::AbstractMatrix, C22::AbstractMatrix)
    U12 = chol.U' \ C12
    U22 = cholesky(Symmetric(C22 - U12'U12)).U
    return Cholesky([chol.U U12; zero(U12') U22], 'U', 0)
end

This functions avoids the recalculation of U11, so its algorithmic complexity is perfect, but it moves a ton of memory around in

[chol.U U12; zero(U12') U22]

Let us consider U11 to be a 2x2 matrix for example, then its entries in base 2 are

u00  u10
u01  u11

i.e. it is stored column wise i.e. its memory is u=[u00, u01, u10, u11] with u01=0 as we have an upper triangular matrix. Now let us add one line, i.e. U12 is a 2x1 matrix and U22 a 1x1 matrix. Let us denote the entries of U12 as [a00; a01]. Then the final matrix has to be

u00   u10   a00
0     u11   a01
0     0     U22

which in memory is stored as [u00, 0, 0, u10, u11, 0, a00, a01, U22] That means that essentially all the memory except for the first colum (u00, u01) had to be moved. If you evaluate the GP one by one, this is of relevant complexity. Since moving the entire matrix so far (except for the first column) costs $O(n(n-1)) = O(n)$ which is the same complexity as the calculation of the next column, since the calculation of U12 is of complexity $O(n^2)$. But these memory operations are expensive so they have more weight I would think.

You can solve this issue by using the PackedStorage format. I.e. you store the triangular matrix

u00   u10   u20
0     u11   u21
0     0     u22

as [u00, u10, u11, u20, u21, u22] adding another column is then is simply appending in memory. This should be significantly cheaper assuming the underlying 1-dimensional array is an extendible vector.

An advantage you get for free is, that the packed storage format takes half the memory to store a triangular matrix. The disadvantage you get is, that BLAS3 routines do not work. This is because BLAS3 operates on block matrices.

This problem can be fixed, if the entries of our packed storage format would already be block matrices. While I have implemented the former, I haven't finished an implementation of packed block matrices yet.

willtebbutt commented 1 year ago

Thanks for looking into this. I believe that the current implementation is an artifact of needing the code to work nicely in conjunction with Zygote, but I don't think I've looked at it since it was first implemented so I might be misremembering.

FelixBenning commented 1 year ago

@willtebbutt I mean this is a reasonable implementation if you do not expect to incrementally evaluate a random function step by step. And to make this step by step evaluation performant you need a non-standard memory layout. So it isn't bad code. It just doesn't work nicely with incremental evaluation.

I also didn't do benchmarks so far - this is only a theoretical consideration. I was interested in Bayesian Optimization on high dimensional random functions. And this resulted in a form of gradient descent but with calculable step sizes. To test those out I needed to be able to simulate high dimensional, differentiable random functions and felt like it was easier to write a package myself than try and fit this non-standard requirement into existing packages. Now I try to circle back and see what I actually need and how to merge this into existing packages.

The non-standard memory layout should maybe be its own package - it is essentially a different implementation of an abstract array which works nicely with incremental cholesky decompositions

Crown421 commented 1 year ago

I have been looking into this "lazy" evaluation of a GP before, I think the first time I have seen it was in this paper.

For me, this falls under "sampling" a GP, maybe perhaps be seen as an alternative of decoupled pathwise sampling or just naively sampling on a grid and interpolating, all of which of some pros and cons in different scenarios.

I think it would make sense to have a some common interface for all these methods to get realizations from a AbstractGP, but perhaps the best place for them would be ApproximateGPs.jl?

FelixBenning commented 1 year ago

I don't think that this belongs into ApproximateGPs.jl as the lazy evaluation is exact if based on an incremental cholesky decomposition. Approximations are also possible to reduce the increasing cost of successive function calls but not necessary.

Crown421 commented 1 year ago

I could see the proposed method based on incremental cholesky here, and then approximate methods for sampling the GP (incremental choleksy but with finite cyclic buffer, decoupled sampling, grid sampling,...).

I think an interface like rand(::GP, ::SamplingMethod) makes sense to me. The default could be the proposed lazy sampling, such that i.e. rand(gp, LazySampling) returns a function fs, which can be evaluated at locations x. This can then be extended to rand(GP, DecoupledSampling), rand(GP, GridSampling(::Range, ...), rand(GP, FinBufferSampling(n) in ApproximateGPs.