matrix-valued kernels for structured multi-output

gszep commented 4 years ago

we would like matrix-valued kernels so that we can learn vector fields! Good starting points would be the divergence-free and curl-free kernels. There exist python and matlab implementations of this. @theogf

devmotion commented 4 years ago

I defined custom matrix-valued kernels in https://github.com/devmotion/CalibrationErrors.jl/ up to version 0.3, but then replaced them with tensor product kernels on an extended input space (that encodes the desired dimension of the output of the vector-valued function in the original RKHS of vector-valued functions as well) + KernelFunctions in more recent versions since it made it much easier to exploit all existing scalar kernels in KernelFunctions. I'd guess that depending on your use case you might be able to use tensor product kernels instead of matrix-valued kernels as well - the disadvantage of this approach is that the domain you're working in might become less "nice".

IMO this "trick" is similar to what @willtebbutt and others have been using for multi-output GPs.

willtebbutt commented 4 years ago

Yeah -- you can find the KernelFunctions view on this here: https://willtebbutt.github.io/resources/mogp_duality.pdf

It's GP-centric, but I think the point stands. The idea is that matrix-valued kernels / vector-valued functions are really just scalar-valued kernels / functions on an extended domain. This lets us recycle our existing abstractions :)

Crown421 commented 4 years ago

I am very interested in this, as I currently working on/ with matrix kernels, and I would be very happy to help/ work on this. Matrix kernels are a very rich class though, so I think some amount of thoughts should be made before building them. While all kernels I work with are derived from scalar kernels, I think the fundamental difference is between the amount of correlation assumed to be between output variables.

Naively, a matrix kernel K(a,b) is just a function R^L x RˆL -> R^D, for L input dimensions and D output dimensions. That means that in general for some data array X with N data points, you get the kernel matrix K(X,X) which is of dimension NDxND.

But more specifically, one can assume all outputs are independent, so the matrix kernel is K(a,b) = diag(k_1(a,b), ..., k_L(a,b)), in which case one should not build (and invert) a NDxND matrix, but instead work with D NxN matrices. This is very common from what I can tell, and I think what @devmotion is talking about? More generally, one could assume correlation between output 1 and output 2, but both are independent from output 3, so one could work with a 2Nx2N and a NxN matrix.

The kernels in the above on the diverge-free and curl-free and other interesting matrix kernels, the matrix is fully dense but the matrix kernel contains a dyadic product and a diagonal kernel, which I think could also be exploited.

Long story short, matrix kernels are super interesting, and the implementation should from the start be designed for a zoo of specialized implementations, and probably also make it easy for users to add new ones.

I am still looking at this package, so not sure how much of the above would be in scope. I might also make sense to use the sub-module structure mentioned by Kristoffer Carlsson in his talk in JuliaCon, given that most people only use scalar kernels I think.

devmotion commented 4 years ago

This is very common from what I can tell, and I think what @devmotion is talking about?

No, it seems that's different from what I was using in my work and suggesting above. The corresponding RKHS to a matrix-valued kernel is a space of vector-valued functions f : D -> R^n. To every such function there exists a corresponding function g : D x {1, ..., n} -> R that is defined by g(x, i) = f_i(x). Clearly the space of these functions is still a RKHS, but now of functions scalar-valued functions on the extended domain D x {1, ..., n}. Hence there exists a scalar-valued reproducing kernel k : (D x {1, ..., n}) x (D x {1, ..., n}) with inputs on the extended domain D x {1, ..., n}. This kernel is equivalent to the original matrix-valued kernel, the main difference being the different domain of the inputs. BTW I just noticed this is also mentioned on Wikipedia.

Hence in contrast to matrix-valued kernels with special structures such as the ones mentioned above, this concept (or "trick") applies to any matrix-valued kernels (and is very similar to viewing multi-output GPs as single-output GPs on an extended domain). Hence mathematically nothing is lost by only considering scalar-valued kernels on arbitrary domains. Thus I agree with @willtebbutt that probably any implementation should use the existing abstractions and be focused on scalar-valued kernels only. IMO this would help to keep the implementation consistent and the code complexity low and additionally provide users always with a large set of predefined kernels, regardless of whether they work with matrix-valued or scalar-valued kernels. In my work I noticed that in particular the last point is a huge advantage for users (and me as a developer since I don't have to reimplement everything from scratch).

A common choice of kernels on these extended domains are tensor product kernels, i.e., kernels of the form k((x1, x2), (y1, y2)) = k_1(x1, y1) x k_2(x2, y2). I added a generic implementation of these kernels to KernelFunctions since I used them. Maybe that's helpful for your work as well. In general, IMO it would be great if we could add support for other kernel structures on product spaces and make it easier for users to construct such kernels.

gszep commented 4 years ago

So let's take the example of the curl-free gaussian basis kernel

how would that look like?

curlFreeKernel = SqExponentialKernel()*TensorProduct( ? )

devmotion commented 4 years ago

In this case the corresponding scalar-valued kernel is k((x, s), (y, t)) = exp(-|x-y|^2) (delta(s, t) - (x_s - y_s) (x_t - y_t)). It seems it can't be written as a tensor product kernel (only the first part can be written as product of two functions that depend exclusively on x, y and s, t). So it seems that would require a custom implementation.

In my work this reformulation helped to actually simplify the constructions and lead to some (IMO) more intuitive formulations. However, that's not true in general (e.g., Wikipedia also mentions that "this isometry procedure can make both the scalar-valued kernel and the input space too difficult to work with in practice"), so maybe it doesn't help to simplify other theoretical parts in your work.

However, for the implementation in KernelFunctions I think it still makes sense to focus on scaIar-valued kernels AND additionally having some user-facing API that makes it easy to work with these extended domains (maybe even providing an interface that hides the scalar-valued implementations and allows users to specify the matrix-valued kernel function directly). Basically similar to the multi-output GPs.

Crown421 commented 4 years ago

My apologies, I know of the feature space approach to matrix kernels, and I even read that exact Wikipedia page, but somehow missed that section. Thank you for this, I will have to look into the properties of such extended kernels, as I am not clear on what they mean for correlation between components. I agree that for kernels of the type you are suggesting it is very simple to have a single simple function that constructs the matrix from any given scalar kernel. From my understanding for a kernel of the form k((x1, x2), (y1, y2)) = k_1(x1, y1) x k_2(x2, y2) , you would create the matrix kernel as K(x1, x2) = (k(x1,x2) x k(i,j) )_ij ? In this case, the user would need to provide two(?) scalar kernels and get out a matrix?

The kernels that I was talking about are the ones mentioned in this paper https://arxiv.org/abs/1807.09111. I think it should be possible to write a nice API to generate arbitrary matrix kernels of this form, as well as arbitrary combinations.

Even for the curl-free kernel, it is basically of the form k(x1,x2) (I - (x1-x2) * (x1-x2)^T ), where they chose k(x1,x2) = SqExponentialKernel(), but I think one could use any scalar kernel instead, for example a curl free Matern matrix kernel.

JuliaGaussianProcesses / KernelFunctions.jl

matrix-valued kernels for structured multi-output #142