interpretation of scale parameter as variance?

cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch

MIT License

3.57k stars 562 forks source link

interpretation of scale parameter as variance? #889

Open cmlakhan opened 5 years ago

cmlakhan commented 5 years ago

I apologize if this is a dumb question but I wanted to understand whether the scale parameter can parameter can be interpreted as a decomposition of variance components. I modified your multi-gpu model which contains the protein dataset so that the model is a sum of a rbf and linear kernel as follows:

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, n_devices):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module_1 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        self.covar_module_2 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel())
        base_covar_module = self.covar_module_1 + self.covar_module_2

        self.covar_module = gpytorch.kernels.MultiDeviceKernel(
            base_covar_module, device_ids=range(n_devices),
            output_device=output_device
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

When I estimate the scale normalized scale parameters I get the following:

print('noise: %.3f \n '
      'rbf kernel scale: %.3f \n   '
      'rbf kernel length parameter: %.3f \n  '
      'linear kernel scale: %.3f \n  '
      'linear kernel variance: %.3f'         %
      (model.likelihood.noise.item(),
       model.covar_module_1.outputscale.item(),
       model.covar_module_1.base_kernel.lengthscale.item(),
       model.covar_module_2.outputscale.item(),
       model.covar_module_2.base_kernel.variance.item()
                                          ))

noise: 0.077 
 rbf kernel scale: 0.818 
   rbf kernel length parameter: 0.299 
  linear kernel scale: 0.693 
  linear kernel variance: 0.693

Since y has been standardized am I to interpret noise^2, rbf kernel scale^2, and linear kernel variance^2 to be the decomposition of the variance of y into their components based on their kernel components?

The square of the terms 0.077^2 + 0.818^2 + 0.693^3 = 1.00786 which is close to 1 but not exactly 1.

The reason I ask is because variance decomposition models are used a great deal in statistical genetics and am trying to understand if these GPyTorch estimates are giving me the same thing. This page may be useful for some background in variance decomposition

https://limix.readthedocs.io/en/latest/vardec.html

jacobrgardner commented 5 years ago

Kind of. The scale kernel "signal variance" relative to the likelihood "noise variance" can kind of be viewed as a signal to noise ratio: the higher the signal variance is relative to the noise, the more the GP is trying to actually "fit" variations in the data rather than explain them as noise. E.g., if noise >> signal, then you'll typically end up with a very flat mean function and constant variances, but if signal >> noise, then that means the GP is actually trying to fit the variances.

When you have multiple kernels involved, things get a little messier, but you can still think of it generally in terms of "total signal variance" compared to noise, and kernel components with higher signal variances are contributing more to the model fit.

I believe there are some theoretical discussions of these properties in the literature if you wanted them to be more formal. Does that make sense?

cmlakhan commented 5 years ago

Your explanation makes intuitive sense. In the statistical genetics literature there is a notion of heritability which would be the ratio of variance explained by a genetic 'kernel' matrix divided by the sum of the 'noise variance' + variance explained by a 'genetic kernel variance'. This sounds like a measure similar to what you are saying. I am hoping to extend this to include multiple kernel matrices that are based on different sets of features. There have been folks in the statistical genetics community that have used Gaussian processes regression to quantify this measure in the way that you seem to be describing but I will need to read more in order to see if these match up precisely. Paper for reference below:

https://www.biorxiv.org/content/10.1101/040576v1 also https://www.pnas.org/content/113/27/7377/

If you have any papers that you would suggest from the ML perspective that address this topic that would be most appreciated. Thanks so much for your help!

cmlakhan commented 5 years ago

Sorry also in a slightly related note when I try to sum linear kernels using different data dimensions as follows:

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, n_devices):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module_1 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2]))
        self.covar_module_2 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5]))
        base_covar_module = self.covar_module_1 + self.covar_module_2

        self.covar_module = gpytorch.kernels.MultiDeviceKernel(
            base_covar_module, device_ids=range(n_devices),
            output_device=output_device
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

Using the protein dataset I get the same scale parameters for each kernel. Is there a reason for this? I have seen some discussion about this but am a little unclear as to why this happens. Perhaps I am missing something conceptually about the way these kernels are constructed.

print('noise: %.3f \n '
      'linear kernel 1 scale: %.3f \n   '
      'linear kernel 1 variance: %.3f \n  '
      'linear kernel 2 scale: %.3f \n  '
      'linear kernel 2 variance: %.3f'         %
      (model.likelihood.noise.item(),
       model.covar_module_1.outputscale.item(),
       model.covar_module_1.base_kernel.variance.item(),
       model.covar_module_2.outputscale.item(),
       model.covar_module_2.base_kernel.variance.item()
                                          ))

noise: 0.719 
linear kernel 1 scale: 0.693 
linear kernel 1 variance: 0.693 
linear kernel 2 scale: 0.693 
linear kernel 2 variance: 0.693

jacobrgardner commented 5 years ago

You're running in to a few simple algebraic problems. First, since the linear kernel is defined to be:

v XX^{\top}

The scale kernel doesn't add value, because it gives you s*v XX^{\top}, which is an equivalent equation (e.g., you can always set v to achieve the same result as s*v since both s and v are free parameters).

Second, the sum of linear kernels:

v_{1} XX^{\top} + v_{2} XX^{\top}  = (v_{1} + v_{2}) XX^{\top}

is itself a linear kernel with variance v_{1} + v_{2}. This gives you the same problem, for any setting of v_{1} and v_{2}, you can just set v_{1} = v_{1} + v_{2} and v_{2} = 0, and achieve the same kernel with just one linear kernel.

cmlakhan commented 5 years ago

Thanks for pointing out the scale issue (I will correct that) but in my case I also want a kernel of the form

v_{1} X_1Xv_1^{\top} + v_{2} X_2 X_2^{\top}

Where X_1 corresponds to columns 0-2 and X_2 corresponds to columns 1-3 of the full X matrix.

I would have thoughts the terms

gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2])

and

gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5])

Would correspond to the X_1 and X_2 terms, is that not correct? Or does the LinearKernel only work with the entire X matrix?

cmlakhan commented 5 years ago

@jacobrgardner for the linear kernel is it possible to choose a subset of columns?

gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5])

did not seem to work.

jacobrgardner commented 5 years ago

@cmlakhan active dims should work for the linear kernel, and this is fairly well unit tested.

For what code exactly is this failing?

cmlakhan commented 5 years ago

@jacobrgardner So that was based on this code

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, n_devices):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module_1 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[0,1,2]))
        self.covar_module_2 = gpytorch.kernels.ScaleKernel(gpytorch.kernels.LinearKernel(num_dimensions=3,active_dims=[3,4,5]))
        base_covar_module = self.covar_module_1 + self.covar_module_2

        self.covar_module = gpytorch.kernels.MultiDeviceKernel(
            base_covar_module, device_ids=range(n_devices),
            output_device=output_device
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

but I will remove the scale part as you suggest and re-run. In the above case I am getting the variance parameter for each kernel, which is why I am confused.

print('noise: %.3f \n '
      'linear kernel 1 scale: %.3f \n   '
      'linear kernel 1 variance: %.3f \n  '
      'linear kernel 2 scale: %.3f \n  '
      'linear kernel 2 variance: %.3f'         %
      (model.likelihood.noise.item(),
       model.covar_module_1.outputscale.item(),
       model.covar_module_1.base_kernel.variance.item(),
       model.covar_module_2.outputscale.item(),
       model.covar_module_2.base_kernel.variance.item()
                                          ))

noise: 0.719 
linear kernel 1 scale: 0.693 
linear kernel 1 variance: 0.693 
linear kernel 2 scale: 0.693 
linear kernel 2 variance: 0.693

jacobrgardner commented 5 years ago

@cmlakhan Okay, I understand now. Sorry, I missed the part where your original snippet had different active_dims to the kernels!

Let me take a look. At a glance, all of your hyperparameters appear to be set to just the initialization (e.g., 0.693 is the default initialization for most parameters because it is softplus(0)) -- have you trained the parameters on any data at all?

cmlakhan commented 5 years ago

I thought I did! I am using the following training code, based on your examples.

def train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter,):

    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
    model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
    model.double()
    model.train()
    likelihood.train()

    optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
    # "Loss" for GPs - the marginal log likelihood
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

    with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), gpytorch.settings.max_preconditioner_size(preconditioner_size):

        def closure():
            optimizer.zero_grad()
            output = model(train_x)
            loss = -mll(output, train_y)
            return loss

        loss = closure()
        loss.backward()

        for i in range(n_training_iter):
            options = {'closure': closure, 'current_loss': loss, 'max_ls': 20}
            loss, _, _, _, _, _, _, fail = optimizer.step(options)

            print('Iter %d/%d - Loss: %.10f' % (
                i + 1, n_training_iter, loss.item()
            ))

            if fail:
                print('Convergence reached!')
                break

    print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
    return model, likelihood

then I run the following


model, likelihood = train(train_x, train_y,
                          n_devices=n_devices,
                          output_device=output_device,
                          checkpoint_size=checkpoint_size,
                          preconditioner_size=preconditioner_size,
                          n_training_iter=50)

Why don't I try it again just to make sure, don't want you to waste time unnecessarily. I will have to work on it next week but will let you know once I try it out.