Open cmlakhan opened 5 years ago
Kind of. The scale kernel "signal variance" relative to the likelihood "noise variance" can kind of be viewed as a signal to noise ratio: the higher the signal variance is relative to the noise, the more the GP is trying to actually "fit" variations in the data rather than explain them as noise. E.g., if noise >> signal, then you'll typically end up with a very flat mean function and constant variances, but if signal >> noise, then that means the GP is actually trying to fit the variances.
When you have multiple kernels involved, things get a little messier, but you can still think of it generally in terms of "total signal variance" compared to noise, and kernel components with higher signal variances are contributing more to the model fit.
I believe there are some theoretical discussions of these properties in the literature if you wanted them to be more formal. Does that make sense?
Your explanation makes intuitive sense. In the statistical genetics literature there is a notion of heritability which would be the ratio of variance explained by a genetic 'kernel' matrix divided by the sum of the 'noise variance' + variance explained by a 'genetic kernel variance'. This sounds like a measure similar to what you are saying. I am hoping to extend this to include multiple kernel matrices that are based on different sets of features. There have been folks in the statistical genetics community that have used Gaussian processes regression to quantify this measure in the way that you seem to be describing but I will need to read more in order to see if these match up precisely. Paper for reference below:
https://www.biorxiv.org/content/10.1101/040576v1 also https://www.pnas.org/content/113/27/7377/
If you have any papers that you would suggest from the ML perspective that address this topic that would be most appreciated. Thanks so much for your help!
Sorry also in a slightly related note when I try to sum linear kernels using different data dimensions as follows:
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood, n_devices):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module_1 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2]))
self.covar_module_2 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5]))
base_covar_module = self.covar_module_1 + self.covar_module_2
self.covar_module = gpytorch.kernels.MultiDeviceKernel(
base_covar_module, device_ids=range(n_devices),
output_device=output_device
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
Using the protein dataset I get the same scale parameters for each kernel. Is there a reason for this? I have seen some discussion about this but am a little unclear as to why this happens. Perhaps I am missing something conceptually about the way these kernels are constructed.
print('noise: %.3f \n '
'linear kernel 1 scale: %.3f \n '
'linear kernel 1 variance: %.3f \n '
'linear kernel 2 scale: %.3f \n '
'linear kernel 2 variance: %.3f' %
(model.likelihood.noise.item(),
model.covar_module_1.outputscale.item(),
model.covar_module_1.base_kernel.variance.item(),
model.covar_module_2.outputscale.item(),
model.covar_module_2.base_kernel.variance.item()
))
noise: 0.719
linear kernel 1 scale: 0.693
linear kernel 1 variance: 0.693
linear kernel 2 scale: 0.693
linear kernel 2 variance: 0.693
You're running in to a few simple algebraic problems. First, since the linear kernel is defined to be:
v XX^{\top}
The scale kernel doesn't add value, because it gives you s*v XX^{\top}
, which is an equivalent equation (e.g., you can always set v
to achieve the same result as s*v
since both s
and v
are free parameters).
Second, the sum of linear kernels:
v_{1} XX^{\top} + v_{2} XX^{\top} = (v_{1} + v_{2}) XX^{\top}
is itself a linear kernel with variance v_{1} + v_{2}
. This gives you the same problem, for any setting of v_{1}
and v_{2}
, you can just set v_{1} = v_{1} + v_{2}
and v_{2} = 0
, and achieve the same kernel with just one linear kernel.
Thanks for pointing out the scale issue (I will correct that) but in my case I also want a kernel of the form
v_{1} X_1Xv_1^{\top} + v_{2} X_2 X_2^{\top}
Where X_1
corresponds to columns 0-2 and X_2
corresponds to columns 1-3 of the full X matrix.
I would have thoughts the terms
gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2])
and
gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5])
Would correspond to the X_1
and X_2
terms, is that not correct? Or does the LinearKernel only work with the entire X matrix?
@jacobrgardner for the linear kernel is it possible to choose a subset of columns?
gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5])
did not seem to work.
@cmlakhan active dims should work for the linear kernel, and this is fairly well unit tested.
For what code exactly is this failing?
@jacobrgardner So that was based on this code
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood, n_devices):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module_1 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2]))
self.covar_module_2 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5]))
base_covar_module = self.covar_module_1 + self.covar_module_2
self.covar_module = gpytorch.kernels.MultiDeviceKernel(
base_covar_module, device_ids=range(n_devices),
output_device=output_device
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
but I will remove the scale part as you suggest and re-run. In the above case I am getting the variance parameter for each kernel, which is why I am confused.
print('noise: %.3f \n '
'linear kernel 1 scale: %.3f \n '
'linear kernel 1 variance: %.3f \n '
'linear kernel 2 scale: %.3f \n '
'linear kernel 2 variance: %.3f' %
(model.likelihood.noise.item(),
model.covar_module_1.outputscale.item(),
model.covar_module_1.base_kernel.variance.item(),
model.covar_module_2.outputscale.item(),
model.covar_module_2.base_kernel.variance.item()
))
noise: 0.719
linear kernel 1 scale: 0.693
linear kernel 1 variance: 0.693
linear kernel 2 scale: 0.693
linear kernel 2 variance: 0.693
@cmlakhan Okay, I understand now. Sorry, I missed the part where your original snippet had different active_dims to the kernels!
Let me take a look. At a glance, all of your hyperparameters appear to be set to just the initialization (e.g., 0.693 is the default initialization for most parameters because it is softplus(0)
) -- have you trained the parameters on any data at all?
I thought I did! I am using the following training code, based on your examples.
def train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter,):
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
model.double()
model.train()
likelihood.train()
optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), gpytorch.settings.max_preconditioner_size(preconditioner_size):
def closure():
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
return loss
loss = closure()
loss.backward()
for i in range(n_training_iter):
options = {'closure': closure, 'current_loss': loss, 'max_ls': 20}
loss, _, _, _, _, _, _, fail = optimizer.step(options)
print('Iter %d/%d - Loss: %.10f' % (
i + 1, n_training_iter, loss.item()
))
if fail:
print('Convergence reached!')
break
print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
return model, likelihood
then I run the following
model, likelihood = train(train_x, train_y,
n_devices=n_devices,
output_device=output_device,
checkpoint_size=checkpoint_size,
preconditioner_size=preconditioner_size,
n_training_iter=50)
Why don't I try it again just to make sure, don't want you to waste time unnecessarily. I will have to work on it next week but will let you know once I try it out.
I apologize if this is a dumb question but I wanted to understand whether the scale parameter can parameter can be interpreted as a decomposition of variance components. I modified your multi-gpu model which contains the protein dataset so that the model is a sum of a rbf and linear kernel as follows:
When I estimate the scale normalized scale parameters I get the following:
Since y has been standardized am I to interpret noise^2, rbf kernel scale^2, and linear kernel variance^2 to be the decomposition of the variance of y into their components based on their kernel components?
The square of the terms 0.077^2 + 0.818^2 + 0.693^3 = 1.00786 which is close to 1 but not exactly 1.
The reason I ask is because variance decomposition models are used a great deal in statistical genetics and am trying to understand if these GPyTorch estimates are giving me the same thing. This page may be useful for some background in variance decomposition
https://limix.readthedocs.io/en/latest/vardec.html