cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.58k stars 562 forks source link

GP params / DKL #994

Closed cherepanovic closed 4 years ago

cherepanovic commented 4 years ago

in following constellation Dense(784,784) -> GP (see snippets below) I have following parameters for the optimization

torch.Size([784, 784])
torch.Size([784])
torch.Size([])
torch.Size([1, 1])
torch.Size([1])
torch.Size([784, 64])
torch.Size([784, 64, 64])
torch.Size([10, 784])

The first two lines are parameters of the dense layer.


torch.Size([784, 784])
torch.Size([784])

The rest is supposing to be from the GP

torch.Size([])
torch.Size([1, 1])
torch.Size([1])
torch.Size([784, 64])
torch.Size([784, 64, 64])
torch.Size([10, 784])

Could you give a short explanation what the parameters are and whether it makes sense to visualize some of them during training?

code snippets

GP layer

class GaussianProcessLayer(gpytorch.models.AbstractVariationalGP):
    def __init__(self, num_dim, grid_bounds=(-10., 10.), grid_size=64):
        variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
            num_inducing_points=grid_size, batch_size=num_dim
        )
        variational_strategy = gpytorch.variational.AdditiveGridInterpolationVariationalStrategy(
            self, grid_size=grid_size, grid_bounds=[grid_bounds], num_dim=num_dim,
            variational_distribution=variational_distribution, mixing_params=False, sum_output=False
        )
        super().__init__(variational_strategy)

        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel(
                lengthscale_prior=gpytorch.priors.SmoothedBoxPrior(
                    math.exp(-1), math.exp(1), sigma=0.1, transform=torch.exp
                )
            )
        )
        self.mean_module = gpytorch.means.ConstantMean()
        self.grid_bounds = grid_bounds

    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)

DLK snippet

self.lin = nn.Linear(784,784)
self.gp_layer = GaussianProcessLayer(num_dim=num_dim, grid_bounds=grid_bounds)
self.grid_bounds = grid_bounds
self.num_dim = num_dim

optimizer

    optimizer = SGD([
        {'params': model.lin.parameters(), 'weight_decay': 1e-3},
        # {'params': model.lin2.parameters(), 'weight_decay': 1e-4},
        # {'params': model.seqsum.parameters(), 'weight_decay': 1e-4},
        {'params': model.gp_layer.hyperparameters(), 'lr': lr * 0.01},
        {'params': model.gp_layer.variational_parameters()},
        {'params': likelihood.parameters()},
    ], lr=lr, momentum=0.9, nesterov=True, weight_decay=0)
jacobrgardner commented 4 years ago

@cherepanovic Sorry for the long delay on responding here -- both Geoff and I were at NeurIPS all last week.

The parameters involved in variational inference include:

cherepanovic commented 4 years ago

@jacobrgardner glad to see you and thanks a lot for your response!

what you are seeing is a num_features x num_inducing mean

it is graphical interpretable?

Is there generally a way to interpret the learning/learned density/boundaries in lower dimensions graphically?

gpleiss commented 4 years ago

@cherepanovic - "graphically interpretable" is a bit subjective. However, each of the features output by the GP are independent. What I would do is i would make num_features plots, each plotting inducing_mean as a function variational_strategy.inducing_points for each one of the num_features.

Is there generally a way to interpret the learning/learned density/boundaries in lower dimensions graphically?

This might be difficult because the output is a bit high dimensional. Something like TSNE might be your best bet.

(To answer your question - as far as I know there is no established practice, but these are ideas for where I would start.)

Jianf-Wang commented 3 years ago

@jacobrgardner Hello, I just see the value of variational covariance, but it is a lower triangular matrix. So, is it a covariance matrix or just a cholesky factor (L) ? If I would like to get covariance matrix, should I use LL^T ?