[Question] Scalability of multitask GP with >1000 tasks

Nacho888 commented 1 year ago

I am trying to fit a model in which each sample has three float features (X) and a corresponding set of 13 vectors of 100 points (y).

Dataset shape (X, y): torch.Size([1500, 3]), torch.Size([1500, 13, 100])

I looked into issues such as #1763 (as at some point I will also like to use DataLoaders), but I do not know if my dataset shape is valid or how I should tackle the problem. I believe that the approach should be similar to that of:

With in my case, 13 tasks. But I happen not to be able to manage the extra dimension with the 100 points. I might be confusing the underlying terms of the architectures but I believe that it should be possible to model this problem with GPs.

Any help would be more than welcome :)

Turakar commented 1 year ago

I would reshape the y-data to (1500, 1300) and fit a Multitask-Model on that.

Balandat commented 1 year ago

If you only have a 100 data points per task (and the locations are shared across all tasks) you can also use https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_Exact_GPs/Multitask_GP_Regression.ipynb, which uses the Kronecker structure of the covariances to make this scalable. Are your observations noiseless? If so then you actually won't get any benefit from multi-task modeling here (a property called autokrigeability). If your observations are noisy then this can be helpful!

Nacho888 commented 1 year ago

Thanks both of you for answering! I would like to give a little bit more of context on my dataset, the 13 vectors of 100 points (then as suggested, torch.Size([1500, 1300])) that I need to predict are related to just one observation of 3 floating point numbers which represent my features (torch.Size([1500, 1300])). My observations correspond to a set of integral correlators that I need to model (in order to speed up computations) and they were generated using a Fast Fourier algorithm based on some KNOWN functions that take a significant amount of time, so I assume that they should be kinda noiseless... (I am not really sure about this statement).

So if I do something like this:

import gpytorch

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, X_train, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(X_train, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=13
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=13, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=13)
model = MultitaskGPModel(X_train, y_train, likelihood)

It always crashes when trying to evaluate the loss function/criterion:

for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(X_train)
    print(output)
    print(y_train.shape)
    loss = -mll(output, y_train)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
    optimizer.step()

As in this case the two prints evaluate to: MultitaskMultivariateNormal(loc: torch.Size([19500])) and torch.Size([1500, 1300]) respectively being tensors of 19500 and 1950000 in size. Obviously this causes a mismatch in the input of the criterion making it fail.

Turakar commented 1 year ago

I have two points regarding your model:

Setting num_tasks=13 seems to be not appropriate, as this leads to 13 outputs per input, which can never produce the 13*100 outputs you require. You could try the following two solutions: a) Set num_tasks=1300, or b) use a batched model with 100 as the batch size and 13 tasks. While a) models dependencies between all outputs, b) assumes the 100 entries of one vector to be independent, but only models dependencies between the 13 tasks.
If you find your model to not be expressive enough, you might want to have a look at LCMKernel.

Nacho888 commented 1 year ago

I managed to train the model with the following architecture and it seems to be optimizing!

import gpytorch

N_INTEGRALS = 13

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, X_train, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(X_train, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=100
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel() + gpytorch.kernels.MaternKernel(), num_tasks=100, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

models = []
likelihoods = []
for correlator_idx in range(N_INTEGRALS):
    likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=100)
    likelihoods.append(likelihood)
    models.append(MultitaskGPModel(X_train, y_train[:, correlator_idx, :], likelihood))

model = gpytorch.models.IndependentModelList(*models)
likelihood = gpytorch.likelihoods.LikelihoodList(*likelihoods)

mll = SumMarginalLogLikelihood(likelihood, model)

The training loop is as follows (and works for 13 models which take 1350 samples of 3 values each):

X_train = X_train.to(DEVICE)
y_train = y_train.to(DEVICE)

# Find optimal model hyperparameters
model.to(DEVICE)
likelihood.to(DEVICE)

model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

training_iter = 50
losses = []
for i in range(training_iter):
    optimizer.zero_grad()
    output = model(*model.train_inputs)
    loss = -mll(output, model.train_targets)

    print(f"Iter {i + 1}/{training_iter} - Loss: {loss.item()}")
    losses.append(loss.item())

    loss.backward()
    optimizer.step()

But I run into GPU memory issues when trying to infer the test set (which is just 150 vectors of 3 values):

X_test = X_test.to(DEVICE)
y_test = y_test.to(DEVICE)

model.eval()
likelihood.eval()

losses = []
t_start = perf_counter()
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    X_test = [X_test for _ in range(N_INTEGRALS)]
    predictions = likelihood(*model(*X_test))

t_stop = perf_counter()
print(f"Training time: {t_stop - t_start:.4f}s")

And I get the usual error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 67.89 GiB (GPU 0; 4.00 GiB total capacity; 145.50 MiB already allocated; 2.55 GiB free; 228.00 MiB reserved in total by PyTorch).

However, I do not know why this is happening.

Balandat commented 1 year ago

by passing a 150 x 3 tensor to each of the models, you're evaluating the joint posterior across all 150 points. And since this is a multi-task model with 100 outputs the covariance matrix is actually of size (150100) x (150100), so pretty large, especially if you have 13 of these flying around (together with the train-train covariances computed on the model). If you're ok with just the marginal at each data point you can pass in X_test.unsqueeze(-2) which will result in posterior covariance (batched) of shape 150 x 100 x 100

Also, a side note: You're using a rank=1 cross-task covariance matrix for 100 tasks, which does seem rather ambitious - is this intended?

Nacho888 commented 1 year ago

Thanks for your answer! I did some experiments in the inference part but I end up with the same memory issues so that seems to not be helping me. Also on CPU is trying to allocate 72.9GB on RAM which does not make any change also. I guess I could train individual models and do 13 inferences separately, but I suspect that I will end up running into the same issues.

I have at least tried all of the approaches mentioned on this thread but the only one that seems to be somehow working is this approach of using a Multi-Task model that maps these 3 values on the X-space to the 100 points that I am trying to predict (13 times as each one the vectors corresponds to a different function). That is why I have decided to train 13 individual models that do this mapping and group them together in the IndependentModelList and indeed seems to be learning something (hopefully) as the train loss converges pretty quick and reasonably. However I cannot verify my results since I cannot test the generalization performance as the testing loop is not feasible to compute.

Ps: maybe I am not understanding the concept of the 'rank' parameter but leaving the default value which I think that is zero, gives me worse results when training (slower convergence and higher loss) than if I use rank=1.

Turakar commented 1 year ago

What do you mean by the test loop not being feasible? You could use batches there for reducing the memory usage, in contrast to training.

Nacho888 commented 1 year ago

Sorry with 'testing loop' I just meant the inference process. However, I have managed to infer one datapoint at a time, which is really slow but it is better than nothing. I suspect that this is not by any means the best way to exploit the full potential of this library.

gpleiss commented 1 year ago

@Nacho888 This is a use-case (having >1000 outputs) that GPyTorch is not optimized for. I'm sure there are plenty of smart solutions that could be implemented for better scalability, but none currently available in GPyTorch at the moment.

Balandat commented 1 year ago

Ps: maybe I am not understanding the concept of the 'rank' parameter but leaving the default value which I think that is zero, gives me worse results when training (slower convergence and higher loss) than if I use rank=1.

So rank here is just the rank of the low-rank approximation used for the cross-task covariance matrix. The higher the rank the more expressive this approximation is; but at the same time the more parameters need to be inferred. You could test empirically what gives you the best out-of-sample prediction performance.

I have managed to infer one datapoint at a time, which is really slow but it is better than nothing

You can probably batch together a bunch of data points for prediction so you don't have to loop through all of them individually. Make sure that you're passing these in with a batch shape so that you're not computing the joint distribution across points; i.e. something like prediction_batch_size x 1 x d rather than prediction_batch_size x d.

cornellius-gp / gpytorch

[Question] Scalability of multitask GP with >1000 tasks #2283