Reducing Allocations - Githubissues

JuliaGaussianProcesses / AbstractGPs.jl

Abstract types and methods for Gaussian Processes.

https://juliagaussianprocesses.github.io/AbstractGPs.jl/dev

Other

217 stars 20 forks source link

Reducing Allocations #156

Open kaandocal opened 3 years ago

kaandocal commented 3 years ago

Unless I'm missing something in the documentation it seems that the default implementations (at least for PosteriorGP) are quite allocation heavy, e.g. when computing the mean/covariance of a GP at a single location as well as a vector of locations. It would be quite useful to have non-allocating equivalents, especially when querying the GP posterior at a single point where on the surface no vectors are involved.

willtebbutt commented 3 years ago

Agreed. My personal preference would be to get people to utilise / further develop things like AutoPreallocation, rather than having to maintain two implementations of basically the same functionality.

willtebbutt commented 3 years ago

Thinking about it, AutoPreallocation might work moderately nicely in our case as-is, and get rid of a lot of memory use.

kaandocal commented 3 years ago

Sounds like an interesting idea, I haven't heard of this one before! In the meanwhile I'll have a look and see if we can reduce allocations manually.

kaandocal commented 3 years ago

On a first glance it does seem like doing this manually will require a lot of auxiliary functions, eg.

function Statistics.mean!(out::AbstractVector, f::PosteriorGP, x::AbstractVector)
     mean!(out, f.prior, x)
     mul!(out, cov(f.prior, x, f.data.x), f.data.α, true, true)      # Still allocates for cov
end

or manual matrix operations (good idea?). Given that a lot of these functions are multiply nested would AutoPreallocation.jl handle them gracefully?

kaandocal commented 3 years ago

A quick fix might be to store a dummy array of length N_obs (or two) in the GP struct to avoid having to frequently reallocate arrays of the same size.

devmotion commented 3 years ago

My experience with this in SciML is that currently the only proper way to solve this explicitly (i.e., without AutoPreallocation) is to have dedicated types and caches for in-place and out-of-place operations. Also it becomes very tricky very quickly to ensure that the matrices, vectors etc. are initialized with the correct types and a lot of fine-tuning and special casing is needed. ArrayInterface was designed to help with the special casing and introduces some traits for arrays (e.g., to query mutability).

kaandocal commented 3 years ago

The main computation one has to do is of the form K_xx * K_xy for a single input location y, surely one could add a simple vector or two to the struct? As these are only intermediate results a Vector might even work (I think)...

devmotion commented 3 years ago

The main problem is to guarantee that 1) you can actually mutate stuff and 2) that the element types are correct. In particular, 2) is quite tricky - e.g., what happens if some of the inputs or new data are dual numbers or if you use number types with units? This caused many bugs in e.g. OrdinaryDiffEq since it is quite difficult. And I doubt that this can be solved generally if e.g. posterior(::PosteriorGP, data) does not restrict the type of data.

Additionally, if you use static arrays usually you don't want to allocate any Vectors and avoid any in-place operations.

kaandocal commented 3 years ago

I see your point. I'll try to experiment with AutoPreallocation in this case.

kaandocal commented 3 years ago

Another approach would be to avoid vectorisation and write explicit loops using k(x_i, y) instead. Any thoughts?

devmotion commented 3 years ago

Where exactly? It's difficult to say without concrete example but in general I don't think it's a good idea to rewrite large parts with explicit loops since then you might miss optimizations such as BLAS calls or for special structures of vectors, matrices, and/or kernels.

kaandocal commented 3 years ago

In places like the above code sample. I'm not quite sure how well Julia can optimise such code, I guess this might be another argument for creating some benchmarks. I will try to upload some BayOpt-related ones soonish.