[Question] How to implement the AdditiveStructureKernel to make the training and inference faster

HuangYaowei commented 4 years ago

Here I would like to train a GP model on a very high dimension X, I will first decompose the X into 27 subspace_dim and then uses the addition of 27 MaternKernels as covar_module, however, the speed is even slower than ScaleKernel of origin X without decomposition, what should I do?

Code snippet to reproduce


def split(a, n):  
    # split array a into n approximately equal splits
    k, m = divmod(len(a), n)
    return [tuple(a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)]) for i in range(n)]
# for simplicity, I use random value here, the origin is normalized x and y.
train_x = torch.randn(30, 6912).to('cuda:0')
train_y = torch.randn(30).to('cuda:0')

subspace_dim_list = split([for i in range(train_x.shape[1])], 27)

lengthscale_constraint = Interval(0.005, 2.0)
outputscale_constraint = Interval(0.05, 20.0)
for i in range(27):# range(self.n_sub):  # 27
    # get active dimensions for each subspace_id
    subspace_dim = subspace_dim_list[i]

    kern_i = MaternKernel(lengthscale_constraint=lengthscale_constraint, ard_num_dims=len(subspace_dim), nu=2.5, active_dims=subspace_dim)

    if i == 0:
        kern = kern_i
    else:
        kern += kern_i
covar_module = AdditiveStructureKernel(base_kernel=kern, num_dims=train_x.shape[1])

noise_constraint = Interval(1e-6, 2e-6)
likelihood = GaussianLikelihood(noise_constraint=noise_constraint).to(device=train_x.device, dtype=train_y.dtype)
# ard_dims = train_x.shape[1] if use_ard else None
sub_model = GP(
    train_x=train_x,
    train_y=train_y,
    likelihood=likelihood,
    covar_module=covar_module
    ).to(device=train_x.device, dtype=train_x.dtype)

# Find optimal model hyperparameters
sub_model.train()
likelihood.train()

# "Loss" for GPs - the marginal log likelihood
mll = ExactMarginalLogLikelihood(likelihood, sub_model)

optimizer = torch.optim.Adam([{"params": sub_model.parameters()}], lr=0.1)
# try:
# print(train_x.shape)
# print(train_y.shape)
with torch.enable_grad(), gpytorch.settings.max_cholesky_size(2000):
    for _ in range(1000):
        optimizer.zero_grad()
        output = sub_model(train_x)
        loss = -mll(output, train_y)
        loss.backward()
        optimizer.step()
model_log_likelihood = loss

However, as I run the code on the single GPU, it needs 384s to run. If I use

kern = MaternKernel(lengthscale_constraint=lengthscale_constraint, ard_num_dims=len(input_dim_permutate_list), nu=2.5)
covar_module = ScaleKernel(kern, outputscale_constraint=outputscale_constraint)

it only need 58s, but my original thought is to accelerate it to be faster than 58s, like 10s.

Have I implemented the AdditiveStructureKernel right? Or what should I do? Thank you very much.

By the way, I followed what GPy do, I will also paste their code here

    def _create_model_sub(self, X, Y, input_dim_permutate_list):
        """
        Creates the model for a subspace of dimensions

        :param X: observed input data
        :param Y: observed output data
        :param input_dim_permutate_list: shuffled input dimension list
        """
        # split the input dimensions into nsubspaces
        subspace_dim_list = split(input_dim_permutate_list, self.n_sub)
        # define the additive kernel
        for i in range(self.n_sub): # 27
            # get active dimensions for each subspace_id
            subspace_dim = subspace_dim_list[i]
            kern_i = GPy.kern.Matern52(len(subspace_dim), variance=1., ARD=self.ARD,
                                       active_dims=subspace_dim, name=f'k{i}')
            # kern_i.variance.fix()
            if i == 0:
                kern = kern_i
            else:
                kern += kern_i
        print(kern)
        # define GP model
        noise_var = Y.var() * 0.01 if self.noise_var is None else self.noise_var
        # if not self.sparse_surrogate:
        sub_model = GPy.models.GPRegression(X, Y, kernel=kern, noise_var=noise_var)
        # else:
        #     sub_model = GPy.models.SparseGPRegression(X, Y, kernel=kern, num_inducing=self.num_inducing)

        # if self.exact_feval:
            # restrict noise variance if exact evaluations of the objective
        sub_model.Gaussian_noise.constrain_fixed(1e-6, warning=False)
        # else:
        #     # bound the noise variance if not
        #     sub_model.Gaussian_noise.constrain_bounded(1e-9, 1e6, warning=False)

        # optimise the GP hyperparameters
        try:
            sub_model.optimize(optimizer=self.optimizer, max_iters=self.max_iters, messages=False,
                               ipython_notebook=False)
            model_log_likelihood = sub_model.log_likelihood()
        except:
            model_log_likelihood = -100.00

        return sub_model, model_log_likelihood

gpleiss commented 4 years ago

Sorry for the slow reply @HuangYaowei - in general, it is very likely that additive decompositions are slower than non-additive decompositions (up to maybe, let's say 20,000 data points). This is because GPyTorch can take advantage of parallelism when there isn't a decomposition.

However, it's also worth noting that your implementation can be fixed :) AdditiveStructureKernel is designed to be a parallel object. The way you have currently constructed your kernel:

for i in range(27):# range(self.n_sub):  # 27
    # get active dimensions for each subspace_id
    subspace_dim = subspace_dim_list[i]

    kern_i = MaternKernel(lengthscale_constraint=lengthscale_constraint, ard_num_dims=len(subspace_dim), nu=2.5, active_dims=subspace_dim)

    if i == 0:
        kern = kern_i
    else:
        kern += kern_i
covar_module = AdditiveStructureKernel(base_kernel=kern, num_dims=train_x.shape[1])

^^ This would make all the 27 kernels operate in series. The proper way to use AdditiveStructureKernel is

covar_module = AdditiveStructureKernel(base_kernel=MaternKernel(), num_dims=train_x.shape[1])

The AdditiveStructureKernel wrapper converts the MaternKernel into a batch of MaternKernels (one for each dimension) and then sum them up. By using batch operations we can get GPU parallelism. With the code you wrote (having a separate kernel object for each dimension) there's no way to use GPU parallelism.

HuangYaowei commented 4 years ago

Thanks for your reply @gpleiss . Yes, I have also found that additive decompositions are slower than non-additive decompositions. However, I have an extra question, since we all suppose the GPU runs faster than CPU, however, when I compare the speed with Gpytorch and GPy, I found that in the high dimensional space the Gpytorch did not run faster than GPy, have you compared the speed with other responsories? How can I improve my speed in Gpytorch? The speed comparison is on the environment of 1 GPU and 28 CPUs. I have already evaluated 30 points and want to inference the next point by Matern52.

gpleiss commented 4 years ago

What exactly are you comparing here? Can you provide code examples?

eytan commented 4 years ago

Hi @gpleiss, we are having some difficulties with the additive GPs (even following your AdditiveStructureKernel example), with the additive kernel taking something like 10x longer to fit (w/ 9 dims) compared with full RBF, and just slightly slower to predict w/, for PairwiseGPs in BoTorch. I have a minimal repro using vanilla regression (ScaleKernel(RBFKernel()), which results in a fitting time that is ~2.4 slower, and prediction time that is about 74% slower. Is this expected?

gpleiss commented 4 years ago

How much data? Can you share the repro?

cornellius-gp / gpytorch

[Question] How to implement the AdditiveStructureKernel to make the training and inference faster #1127