JuliaGaussianProcesses / Stheno.jl

Probabilistic Programming with Gaussian processes in Julia
Other
340 stars 26 forks source link

Flux integration + Neural Kernel Network #78

Open willtebbutt opened 4 years ago

willtebbutt commented 4 years ago

Relates to this issue.

A MWE an be found in the examples directory on this branch. It contains a basic demo of

The examples are all packaged nicely in a project, so it should be straightforward to get everything up and running.

The questions now are

@HamletWantToCode what do you think?

HamletWantToCode commented 4 years ago

Hi @willtebbutt , the examples works well, thanks ;)

have I covered all of the functionality of interest to GPFlux?

Not all, I'm still extending it. The next main feature is to make GP in GPFlux works as a normal neural network layer ( I'm currently working on it ), which I think may support e.g. Deep Gauss Process ( Neil Lawrence ) and Variational Gauss process ( David Blei ). ( hope I understand the idea of these papers correctly )

does this miss any important interface requirements?

For these two functionalities, I think you have provided enough interfaces, it's good.

For Flux integration, I personally like the second implementation you provided:

dσ², dg = Zygote.gradient(
    function(σ², g)

        # Manually transform data
        gx = ColVecs(g(x.X))

        # Construct GP and compute marginal likelihood using transformed data
        f = σ² * GP(eq(), GPC())
        fx = f(gx, 0.1)
        return logpdf(fx, y)
    end,
    σ², g,
)

I think it's more clear and straightforward, this is also the way used in GPFlux.

HamletWantToCode commented 4 years ago

I found there are related issues here, this can be realized in NKN. Also sparse GP mentioned here is considered to be added to GPFlux in future.

willtebbutt commented 4 years ago

The next main feature is to make GP in GPFlux works as a normal neural network layer

Could you elaborate a little on what this will look like?

I found there are related issues here, this can be realized in NKN. Also sparse GP mentioned here is considered to be added to GPFlux in future.

I actually don't think that the linear combination of kernels thing is the right way to go about implementing the NKN. As I showed on the branch linked above, I think the right way is probably to have a custom NKN kernel that accepts a collection of primitive kernels, and a Chain (or some other Flux construct) that contains them).

I think it's more clear and straightforward, this is also the way used in GPFlux.

Good to know -- this approach just works out of the box, so there's literally no need to provide explicit integration in Flux to make this work -- I just need to implement worked examples :)

HamletWantToCode commented 4 years ago

I think the right way is probably to have a custom NKN kernel that accepts a collection of primitive kernels, and a Chain (or some other Flux construct) that contains them).

I agree on that, in fact, GPFlux implement a NeuralKernelNetwork type to support NKN, which slightly modifies Flux's Chain.

struct NeuralKernelNetwork{T<:Tuple} <: AbstractKernel
    layers::T
    NeuralKernelNetwork{T}(ls...) where {T} = new{T}(ls)
end

I actually don't think that the linear combination of kernels thing is the right way to go about implementing the NKN

Maybe I don't make myself clear, here I mean most composite kernels can be viewed as special cases of NKN ( linear combination of kernels can also be viewed as a NKN that only has linear layer, weights of the linear layer is the same as coefficients infront of kernels ). In GPFlux, I use NKN as backend for addition kernel and product kernel.

const ProductCompositeKernel = NeuralKernelNetwork{Tuple{Primitive, typeof(allProduct)}}
const AddCompositeKernel = NeuralKernelNetwork{Tuple{Primitive, typeof(allSum)}}

Could you elaborate a little on what this will look like?

This idea comes from the fact that GP is equivalent to a neural network layer which has infinitely many neurons ( some constraints are needed for the weights of this layer ) ( this is indicated by Neal in 1994, proof can be found here in section 2 ). Give me some time, I will try to provide you an example this week :)

willtebbutt commented 4 years ago

Maybe I don't make myself clear, here I mean most composite kernels can be viewed as special cases of NKN ( linear combination of kernels can also be viewed as a NKN that only has linear layer, weights of the linear layer is the same as coefficients infront of kernels ). In GPFlux, I use NKN as backend for addition kernel and product kernel.

Good points. I wonder whether there are any performance implications associated with this though... hmmm.

This idea comes from the fact that GP is equivalent to a neural network layer which has infinitely many neurons ( some constraints are needed for the weights of this layer ) ( this is indicated by Neal in 1994, proof can be found here in section 2 ). Give me some time, I will try to provide you an example this week :)

Ah, I see -- this line of work is slightly different from the Deep GP or variational GP stuff, so I'll be interested to see what you come up with :) Its not clear to me how this will play with the usual Flux way of doing this, in particular how it plays with distributions over functions rather than the deterministic objects that Flux works with.

HamletWantToCode commented 4 years ago

Hi @willtebbutt , I just finish some initial work on Gaussian process layer we discussed last week, implementation can be found in this notebook on this branch. I strongly suggest you to run this notebook by:

  1. git clone git@github.com:HamletWantToCode/GPFlux.jl.git
  2. git checkout develop
  3. In Julia REPL, run add with the location of the GPFlux folder
  4. Then run the notebook
willtebbutt commented 4 years ago

Ah I see. Looks like a nice API to aim for -- definitely needs some way to perform inference though as you're currently just sampling from the prior each time the function is evaluated.

As regards testing with finite differencing -- you just have to be really careful to use exactly the same seed each time you evaluate the function. In particular, you should deterministically set the seed inside the function.

I've added a Stheno.jl version of this proposal here for reference - the point being that Stheno.jl can do all of this stuff with minimal modification.

HamletWantToCode commented 4 years ago

Excellent, it's great to know that Stheno has built-in support for this :)

Based on our previous discussion, I think we both agree that this Flux integration should include:

  1. Using neural network for feature extraction, then classifying/regressing on the extracted features by GP, training the neural network and GP jointly ( also known as deep kernel learning ).
  2. A new kernel type Neural Kernel Network, composite kernels can be built on top of NKN.
  3. Using GP as a non-parametric layer inside a neural network, this is parameter efficient, robust to overfitting and is able to propagate uncertainties.

AFAIK, these functionalities aren't included in Julia community. In Python, Gpytorch supports the first one, and Tensorflow recently has the last one supported in it's extension ( Pyro may support both ). I'd like to have these APIs integrated in Stheno.

willtebbutt commented 4 years ago

Excellent. Let's tackle point number 1 first then, as I think it's likely the most straightforward in the sense that there's no real integration to be done, we just need example code and good documentation. Do you agree / what kinds of resources do you think would be helpful to address this?

HamletWantToCode commented 4 years ago

Do you agree / what kinds of resources do you think would be helpful to address this?

I think recover some experiments on deep kernel learning paper is a good starting point, Gpytorch also use it as a demo. I will working on this these days.

willtebbutt commented 4 years ago

Sounds good. Probably best to start with a small toy dataset where exact inference is tractable. You could also do something with pseudo-points quite straightforwardly -- see Stheno's elbo function :)

HamletWantToCode commented 4 years ago

Hi @willtebbutt , I have implemented a simple step function fitting example here, it use a feedforward neural network plus a GP with ARD kernel, I also write a binary classification example that use Stheno and Turing, examples are writen in a jupyter notebook and are packaged in a project, so it should be straightforward to get everything up and running.

I noticed Stheno has a model zoo but it seems outdated now, so where should these examples go into ?

HamletWantToCode commented 4 years ago

I have run into a problem when trying to make Stheno's output data type to be Float32 ( In Flux, the default data type for neural network parameters is Float32, this could greatly reduce the computation cost when we have large dataset ). Below is an example to reproduce my problem:

using Stheno

X = rand(Float32, 4, 10);
y = rand(Float32, 10);
l = rand(Float32, 4);
σ² = rand(Float32);
kernel = σ²*stretch(EQ(), 1.0f0 ./ l);
K_matrix = pw(kernel, ColVecs(X))  |> eltype    # Float32

gp = GP(0.0f0, kernel, GPC());
noisy_prior = gp(ColVecs(X), 0.01f0);

rand(noisy_prior) |> eltype   # Float64 !!!
logpdf(noisy_prior, y) |> eltype   # Float64 !!!

Though I convert all the input parameter type to Float32, the last two line still give me Float64 type

willtebbutt commented 4 years ago

The example looks really good @HamletWantToCode . Will take a proper look once we're satisfied that the type stability issues have been resolved.

I'm creating a new examples folder at the minute. I've got a branch with them all on that I'll aim to get on to master in the next day or so. It'll be best if you just add a new sub-directory in there once it's available.

willtebbutt commented 4 years ago

Right, I would say that the first item has been more-or-less completed.

Are you up for getting on with adding the Neural-Kernel-Network Kernel @HamletWantToCode ?

HamletWantToCode commented 4 years ago

Yeah, here are my ideas about implementation of Neural-Kernel-Network kernel:

  1. NKN should be a subtype of Stheno's Kernel type, and should have ew & pw interfaces
  2. it's construction is similar to Flux's Chain, three types of layer can be added to it: primitive layer ( contains basic kernels ), linear layer ( superposition of kernel ) and product layer ( kernel production ), also we have to include activation functions ( currently exp function )
  3. it's parameters ( including kernel hyperparameters & linear layer's weight and bias ) can be efficiently extracted ( here I mean we may need a function like Flux's params ) and redistributed ( this is because Optim.jl package only allows us to write all parameters into a 1D array ).

this is an example implementation of parameter redistribution function, hope I have made it clear:

function dispatch!(model, xs::AbstractVector)
    loc = 1
    for p in params(model)
        lp = length(p)
    x = reshape(xs[loc:loc+lp-1], size(p))
        copyto!(p, x)
    loc += lp
    end
    model
end

What do you think ? @willtebbutt

willtebbutt commented 4 years ago

I think this plan sounds very reasonable.

  1. Completely agree
  2. I wonder whether we could just directly use Flux's chain type here?
  3. This also sounds completely reasonable. It's hard to know exactly how we want this to look until there's a PR though.

Maybe open a PR with your plan from above and we can discuss further on that?

willtebbutt commented 4 years ago

Again, didn't mean to close...

HamletWantToCode commented 4 years ago

I wonder whether we could just directly use Flux's chain type here?

I also consider using it, but there maybe some performance related problem, I'll write something first and then we can discuss it explicitly.

Maybe open a PR with your plan from above and we can discuss further on that ?

I'll open a PR once I finish it :)

willtebbutt commented 4 years ago

Great job with the Neural Kernel Network implementation @HamletWantToCode . I noticed that there aren't currently any of the activation function layers implemented. Would you be up for adding the basics quickly before we move on to the third item on the list?

HamletWantToCode commented 4 years ago

The reason I don't include activation functions in the PR are:

  1. Only limited type of functions are allowed, currently just polynomials with positive coefficients & exp function which can be easily implemented by users.
  2. ( Personal opinion ) In practice ( for time series data ), I found activation functions aren't that useful as it is in normal neural network ( since kernels already represent nonlinearity ).

Personally, I'd like to include the activation functions once more of them are found and when they are proved to be useful.

willtebbutt commented 4 years ago

Fair enough. I'm happy not to worry about them for the time being -- as you say, we can always add them later if the need arises.