Move lengthscale outside of kernel

gpleiss commented 6 years ago

Proposal: kernels are not responsible for lengthscales. We can introduce a separate scaling module that divides the data by lengthscales before feeding the data into the kernel

Reasoning: we do some batch-dimension hacking to get fast kernel diagonals, as well as fast batch kernels. For kernel diags - we transform the n x d data into n x 1 x d data, which then only computes kernel diagonals. For additive/multiplicative kernels, we transform the n x d data into d x n x 1 data.

There is a problem when we are using an ARD option for kernels, or when we have separate lengthscales for the different batches. If this lengthscale scaling happens before the data enters the kernel, this problem is mitigated.

In general - this would introduce a convention that kernels should not define their own parameters (which is already the case with output scales).

Rolling in the change: if we're all on board with this, we will deprecate kernel lengthscales. We will encourage users to use the lengthscale module and initialize kernels with lengthscale=False. When we're ready for a major release (and remove lengthscales completely from kernels), then the lengthscale=False kwarg won't be necessary any more.

cc/ @Balandat @darbour @jacobrgardner

Balandat commented 6 years ago

So one use-case is to have different length-scales for different tasks in a multi-task GP. Do you think it would be easy to implement this under the new proposal?

cc @rajkumarkarthik, @darbour

jacobrgardner commented 6 years ago

@Balandat Thought about this a bit. One issue with different lengthscales for different tasks is that it breaks the Kronecker product structure. Right now, exact multi task GPs take O(n^2 + t^2) space and O(n^2t) time. To achieve different lengthscales I don't see an obvious way around constructing the full O(n^2t^2) covariance matrix. Furthermore, without the Kronecker product, I'm not sure that scalable GP methods can be applied at all (except maybe SKIP). Depending on your problem sizes, it might be better to rely on a small shared neural network to extract appropriately scaled features.

That said, if we do resort to explicitly constructing the n x d x t dataset, the lengthscale module makes this easy enough because you would just construct at most an n x d x t lengthscale parameter.

gpleiss commented 6 years ago

@Balandat - @jacobrgardner and I have a better idea. We'll keep lengthscales inside kernels, but the first step in __call__ will scale the data by the lengthscale. This should fix all ARD and batch issues. (more detailed explanation and PR coming soon).

Balandat commented 6 years ago

sounds good, will this work with kernels like the PeriodicKernel?

rajkumarkarthik commented 6 years ago

I think as a very first pass, it would be good to have an implementation of an LCM kernel on GPyTorch. It combines being able to somewhat tailor length scales to tasks while not being computationally burdensome. Thoughts?

Balandat commented 6 years ago

Why would an LCM kernel be less computationally burdensome than having task-wise lengthscales? In terms of Kernel evaluation the task-specific lengthscales should not make any difference, only hyperparameter space would be higher dimensional in the fitting.

Regardless, having an LCM kernel would be useful, if anything as a baseline to compare against. #261

cornellius-gp / gpytorch

Move lengthscale outside of kernel #248