Open ishank-juneja opened 3 years ago
This is actually not quite a bug but seems to be caused by your usage of fast_pred_var
in the inference step which isn't strictly necessary here.
If I remove that flag (as well as removing the effect of the likelihood), then I get sensible outputs:
ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad():
# Get distributions of type multivariate-normal
# prediction_dist1 = likelihood(*model(ux1, ux1))
# prediction_dist2 = likelihood(*model(ux2, ux2))
prediction_dist1 = model(ux1, ux1)
prediction_dist2 = model(ux2, ux2)
# Get distribution of type multi-task multi-variate normal\
# prediction_dist3 = MTlikelihood(MTmodel(ux1))
# prediction_dist4 = MTlikelihood(MTmodel(ux2))
prediction_dist3 = MTmodel(ux1)
prediction_dist4 = MTmodel(ux2)
MT-Model Mean and Variance on a Train Point
mean: [[-0.6976632 0.4489181]]
vars:
[[1.15394592e-04 1.92224979e-06]
[1.93342566e-06 1.18136406e-04]]
------
MT-Model Mean and Variance Nearby a Train Point
mean: [[-0.65350515 0.5149511 ]]
vars:
[[1.10507011e-04 1.90734863e-06]
[1.90734863e-06 1.13129616e-04]]
------
Actual Data Point (True Label)
tensor([[-0.6583, 0.4381]])
If you're wondering why exactly this fixes it, here's what I was able to trace down:
Numerically, what was happening is that the second term in the posterior covariance was dropping to zero due to something happening in the Lanczos decomposition, aka (K_{x_test, X} \kron K_{TT})(K_{XX} \kron K_{TT} + \Sigma)^{-1}(K_{x_test, X} \kron K_{TT}) \approx 0
.
Then, the predictive variance was determined entirely by the first term and the intertask covariance matrix, which was actually matching the predictive variances.
I'll doublecheck the fast_pred_var to see why these predictive variances were so far off and if there's a bug there, but in the meantime, it's probably best to not use fast_pred_var
if you don't have that many test points.
I agree that fast_pred_var
need not be used here, but what would be the justification for not using the likelihood?
Don't we need the likelihood to get the right posterior distribution over the GPs prediction?
That was mostly a tool for debugging what's going on, you ought to be able to use either here.
In general, it depends on if you want the posterior over the latent function given the data, or the posterior over the function values given the data. They'll only differ by the size of the variance (for gaussian observations).
I am not sure if #864 is related but I thought it was worth mentioning here since it too has to do with incorrect variances from the use of gpytorch.settings.fast_pred_var()
That was mostly a tool for debugging what's going on, you ought to be able to use either here.
In general, it depends on if you want the posterior over the latent function given the data, or the posterior over the function values given the data. They'll only differ by the size of the variance (for gaussian observations).
I understand this now assuming that you mean "observations" y = f(x) + eps when you refer to function values
Yes, that's the correct understanding.
@wjmaddox thank you for your kind help so far, I wanted to follow up about the MultiTask kernel issues I have been experiencing. I need to model correlation between the two (or more later) outputs and the MultiTask kernel seems like the way to go about doing that.
Here is an updated version of the colab notebook I had shared earlier. I have now removed the fast-predictive variance flag, and as you pointed out in your initial reply on this thread the variance of the distribution of the latent function given the data (i.e. without likelihood) are quite similar for both the Multi-Task (MT) kernel and the Independent-Model-List (IML).
However, I am still not sure if I am doing/understanding everything correctly about the usage of the MultiTask kernel because The following facts about the trained models confuse me (Here model
is the IML and MTmodel
uses the MT Kernel),
print("- - - - - - - - - \nModel 1a (IML)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[0].likelihood.noise_covar.noise)
- - - - - - - - -
Model 1a
- - - - - - - - -
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
print("- - - - - - - - - \nModel 1b (IML)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[1].likelihood.noise_covar.noise)
- - - - - - - - -
Model 1b
- - - - - - - - -
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
print("- - - - - - - - - \nModel 2 (MultiTask=MT)\n- - - - - - - - - ")
print("Learned Noise Covariance")
- - - - - - - - -
Model 2 (MultiTask=MT)
- - - - - - - - -
Learned Noise Covariance
tensor([0.0054], grad_fn=<AddBackward0>)
Why are the learned noise variances (\sigma_n^2) for both the IML and MT kernel model likelihoods so similar ([0.0055, 0.0055]
and 0.0054
respectively), but the effect of passing the model predictions (i.e. the latent function) through the two likelihoods (Gaussian Likelihood List in case of IML and MT-Gaussian Likelihood in case of MT Kernel model) so different?
🐛 Bug
On training a MultiTask kernel based model and a collection of independent models tied together in an independent model list object on the same dataset, I see variance magnitudes that are orders of magnitude different. It is unclear why this is the case since the model parameters common to the 2 learnt models (The MultiTask model
MTmodel
and the Independent Model Listmodel
) seem to quite similar.To reproduce
Code snippet to reproduce
Stack trace/error message
Expected Behavior
The covariance matrix of the posterior obtained from the MultiTask kernel model is strangely frozen on- [0.56116694, 0.04254041 0.04254041, 0.6265246 ], For both the train data point and a shifted version of it.
I find 2 problems with the covariance Matrix obtained from the MultiTask version.
System information
Please complete the following information:
Additional context
Colab Notebook Version- https://colab.research.google.com/drive/1OalLncVeGtNHh-DqjnkScfy46uTtNud_?usp=sharing