Closed gpleiss closed 6 years ago
So one use-case is to have different length-scales for different tasks in a multi-task GP. Do you think it would be easy to implement this under the new proposal?
cc @rajkumarkarthik, @darbour
@Balandat Thought about this a bit. One issue with different lengthscales for different tasks is that it breaks the Kronecker product structure. Right now, exact multi task GPs take O(n^2 + t^2)
space and O(n^2t)
time. To achieve different lengthscales I don't see an obvious way around constructing the full O(n^2t^2)
covariance matrix. Furthermore, without the Kronecker product, I'm not sure that scalable GP methods can be applied at all (except maybe SKIP). Depending on your problem sizes, it might be better to rely on a small shared neural network to extract appropriately scaled features.
That said, if we do resort to explicitly constructing the n x d x t
dataset, the lengthscale module makes this easy enough because you would just construct at most an n x d x t
lengthscale parameter.
@Balandat - @jacobrgardner and I have a better idea. We'll keep lengthscales inside kernels, but the first step in __call__
will scale the data by the lengthscale. This should fix all ARD and batch issues. (more detailed explanation and PR coming soon).
sounds good, will this work with kernels like the PeriodicKernel?
I think as a very first pass, it would be good to have an implementation of an LCM kernel on GPyTorch. It combines being able to somewhat tailor length scales to tasks while not being computationally burdensome. Thoughts?
Why would an LCM kernel be less computationally burdensome than having task-wise lengthscales? In terms of Kernel evaluation the task-specific lengthscales should not make any difference, only hyperparameter space would be higher dimensional in the fitting.
Regardless, having an LCM kernel would be useful, if anything as a baseline to compare against. #261
Proposal: kernels are not responsible for lengthscales. We can introduce a separate scaling module that divides the data by lengthscales before feeding the data into the kernel
Reasoning: we do some batch-dimension hacking to get fast kernel diagonals, as well as fast batch kernels. For kernel diags - we transform the
n x d
data inton x 1 x d
data, which then only computes kernel diagonals. For additive/multiplicative kernels, we transform then x d
data intod x n x 1
data.There is a problem when we are using an ARD option for kernels, or when we have separate lengthscales for the different batches. If this lengthscale scaling happens before the data enters the kernel, this problem is mitigated.
In general - this would introduce a convention that kernels should not define their own parameters (which is already the case with output scales).
Rolling in the change: if we're all on board with this, we will deprecate kernel lengthscales. We will encourage users to use the
lengthscale
module and initialize kernels withlengthscale=False
. When we're ready for a major release (and remove lengthscales completely from kernels), then thelengthscale=False
kwarg won't be necessary any more.cc/ @Balandat @darbour @jacobrgardner