cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.46k stars 546 forks source link

[Docs] Questions about multitask models #877

Open sirBayesian opened 4 years ago

sirBayesian commented 4 years ago

đź“š Documentation/Examples

Dear development team,

I am trying to use GPyTorch for two distinct research project, both dealing with multitask GP modeling.

1) With basis in our example on Hadamard multitask GPs (https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_GP_Regression/Hadamard_Multitask_GP_Regression.ipynb), I am succeeded to implement a multitask GP with the same likelihood noise variance for all outputs. My questions is now, how to define a distinct likelihood noise variance for each output? In my case, I am considering experiments conducted under different conditions at different labs, thus these may have different noise levels.

2) My basis for the second project is our example on Kronecker multitask GPs (https://github.com/cornellius-gp/gpytorch/blob/e50b9878a090b41eaf38f371cab2b938d5b2ebbc/examples/03_Multitask_GP_Regression/Multitask_GP_Regression.ipynb). The examples shows how to get a diagonal likelihood noise covariance matrix using the class “MultitaskGaussianLikelihood”. The likelihood covariance for this case is parameterized by a scalar “likelihood.raw_noise” and a vector “likelihood.noise_covar.raw_noise”. My question is in this regard how the covariance is parameterized; I presume that “likelihood.noise_covar.raw_noise” is the diagonal of the matrix, but how does “likelihood.raw_noise” relate to this? In my case, I have 4 outputs and 14 inputs, and the corresponding learnt parameters (after transformation to output scale) are “likelihood.raw_noise”: tensor([0.0270], grad_fn=) “likelihood.noise_covar.raw_noise”: tensor([0.0275, 0.0387, 0.0177, 0.1554], grad_fn=)

I really like you software and would like to continue using it, also for NN modeling.

Thank you very much.

Best regards, Sebastian

jacobrgardner commented 4 years ago

@sirBayesian --

For 1, note that IndexKernel learns both raw_var and covar_factor parameters, so that the final task covariance matrix is covar_factor * covar_factor.t() + raw_var * I. The raw_var term here in particular effectively ends up being a noise variance for each output (e.g., the Kronecker product with the diagonal matrix or hadamard product with the subindexed matrix will result in this).

Effectively, the likelihood learns noise common to all tasks, and the raw_var term learns any "additional" noise.

For your second point, I think the likelihood.raw_noise might be a mistaken hold over from before the noise covars were introduced -- e.g., it should not exist and this is a bug.

sirBayesian commented 4 years ago

@jacobrgardner

Thank you for you immediate response!

I just have one follow-up question: In the implementation of both case 1. and 2. above an IndexKernel is used to account for the dependency between tasks, and as both cases may be casted through the Hadamard implementation (1.) by using a copy of the input for each output, how come the likelihood of case 1. uses only a scalar noise variance to account for the noise common for all tasks, whereas case 2. uses a vector of diagonal terms (or potentially a full covariance matrix)?

Thanks again.

Best regards, Sebastian

gpleiss commented 4 years ago

@sirBayesian - the likelihood in the hadamard case is unaware of the task that each point belongs to. Therefore, it must apply the same noise to all samples.

TobyBoyne commented 4 months ago

Hi, sorry to open up an old issue but just wanted to carry on this discussion.

Effectively, the likelihood learns noise common to all tasks, and the raw_var term learns any "additional" noise.

Is this true? The covariance of our prediction is $$K_{xx} + \sigmay^2 I, \quad \text{where } K{xx}=K\text{inputs} \times K\text{tasks}$$

We want to learn a task-dependent noise. However, the raw_var term in your expression isn't a noise term, as it isn't independent of $K_\text{inputs}$. For example, if we had a point very far away from any point in the training data, such that $k(x, xi)=0$, then $K{xx}$ would depend only on $\sigma_y^2$, not on the terms in raw_var. We could have raw_var==[10_000.0, ...], we still wouldn't have any of that noise propagate to the covariance.

I don't think that there currently is support for task-dependent noise. Please correct me if I'm wrong!

gpleiss commented 4 months ago

There is. See https://docs.gpytorch.ai/en/latest/likelihoods.html#multi-dimensional-likelihoods (the task_noise and related parameters).

TobyBoyne commented 4 months ago

Thanks for the reply!

Does that work for the Hadamard Multitask GPs? I was under impression that, since the likelihood doesn't know which task each point belongs to, the multi-dimensional likelihoods wouldn't work? And since the IndexKernel can't learn noise (as discussed in my last comment), there is no way to learn task-dependent noise in the Hadamard case.

gpleiss commented 4 months ago

Ah my bad! Right now there isn't a way. We would be open to a PR!

TobyBoyne commented 4 months ago

I have opened a PR. Would be curious to hear your thoughts on the implementation/interface!