Closed c-lyu closed 1 year ago
Taking a glance at PiecewisePolynomilKernel, I'm pretty sure the issue is here: https://github.com/cornellius-gp/gpytorch/blob/c35d094a8ba5fc2ccd28cd5a8c7bb7e958bb0752/gpytorch/kernels/piecewise_polynomial_kernel.py#L83
We are instantiating a torch.tensor(0.0) without setting it to the same dtype or device as r. I think a fix should be as simple as using torch.tensor(0.0, dtype=r.dtype, device=r.device)
Can you try making that change to source code, and - if it works - submit a bug fix PR?
Thank you for the answer, but unfortunately this fix doesn't work. I just verified that the issue is irrelavant to MultiDeviceKernel, but only with PiecewisePolynomialKernel, because the same error arises when using only a single GPU (the minmal code example is shown below).
According to the error message, the device error happens during back propagation rather than forward passing. Strangely, the inputs and outputs of the model, as well as the loss, are all on the correct device, as can be noticed in logs.
Code snippet
import torch
import gpytorch
import os
import numpy as np
import urllib.request
from scipy.io import loadmat
dataset = 'protein'
if not os.path.isfile(f'../../datasets/UCI/{dataset}.mat'):
print(f'Downloading \'{dataset}\' UCI dataset...')
urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1nRb8e7qooozXkNghC5eQS0JeywSXGX2S',
f'../../datasets/UCI/{dataset}.mat')
data = torch.Tensor(loadmat(f'../../datasets/UCI/{dataset}.mat')['data'])
n_train = 4000
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
output_device = torch.device('cuda:0')
train_x, train_y = train_x.contiguous().to(output_device), train_y.contiguous().to(output_device)
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.PiecewisePolynomialKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
print(f"mean_x.device: {mean_x.device} - {mean_x.size()}")
print(f"covar_x.device: {covar_x.device} - {covar_x.size()}")
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood).to(output_device)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
print(f"train device: x: {train_x.device}, y: {train_y.device}")
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
print(f"loss.device: {loss.device}")
loss.backward()
optimizer.step()
Log output
train device: x: cuda:0, y: cuda:0
mean_x.device: cuda:0 - torch.Size([4000])
covar_x.device: cuda:0 - torch.Size([4000, 4000])
loss.device: cuda:0
Error message
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [7], in <cell line: 4>()
2 output = model(train_x)
3 loss = -mll(output, train_y)
----> 4 loss.backward()
5 optimizer.step()
7 print(loss.item())
File ~/anaconda3/envs/pyg/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(...)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ~/anaconda3/envs/pyg/lib/python3.8/site-packages/torch/autograd/__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Hmm I can take a look later. Mind renaming the issue to reflect that this isn't a MultiDeviceKernel issue?
Sure, I have renamed this issue to be related to PiecewisePolynomialKernel.
🐛 Bug
I was experimenting with the tutorial of Exact GP multiple GPUs here. However, when the base kernel was changed from RBF kernel to piecewise polynomial kernel, an error showed up that tensors are not on the same device.
To reproduce
Code snippet to reproduce
Stack trace/error message
System information
Please complete the following information:
Additional context
I further experimented with training size and similar issue showed up when
n_train = 100
using RBF kernel. Please see the error message below.