Open Rashfu opened 1 year ago
Have you tried using a smaller learning rate? I don't think that this is a bug in GPyTorch, since we are using completely identical code for CPU and GPU.
Thank you for your rely.
I have tried smaller learning rates, starting from 0.3 and gradually decreasing to 0.01, but it still results in NaN values. As the learning rate decreases, the model fails to learn anything.
I simplified my question. Now we are using completely identical code for CPU and GPU. I only added the code
if torch.cuda.is_available():
train_x = train_x.cuda()
train_y = train_y.cuda()
model = model.cuda()
likelihood = likelihood.cuda()
The code runs fine on CPU, but on GPU, it throws a NumericalWarning: CG terminated. Doesn't this mean that GPyTorch has a bug?
This is a simple sample file. I hope you have time to take a look at it. test.zip
Many thanks !
I'm not going to open up your zip sample file. If you can post a small reproducible example in the chat here, then I will take a look.
Here is a simple example (using GP for super-resolution). I use the pixel coordinates XY of the image as train_x
and RGB values as train_y
. Everything works fine when I remove the .cuda()
code.
# ori_image 60 × 60 resolution
image_tensor = transforms.ToTensor()(ori_image)
image_tensor = image_tensor.unsqueeze(0)
b, _, h, w = image_tensor.shape
x = np.arange(w)*2
y = np.arange(h)*2
X, Y = np.meshgrid(x, y)
sample_x = torch.from_numpy(np.stack([X, Y], axis=-1).reshape(-1, 2))
sample_img = image_tensor.squeeze(0).reshape(3, -1).transpose(0, 1)
batch_shape = sample_img.shape[-1]
train_x = sample_x.unsqueeze(0).repeat((batch_shape, 1, 1))
train_y = sample_img.transpose(0, 1)
# [3, 3600, 2], [3, 3600]
print('train_x shape:', train_x.shape, 'train_y shape:', train_y.shape)
class BatchGPModel(gpytorch.models.ExactGP):
def __init__(self, train_inputs, train_targets, likelihood, batch_shape, use_ard=False):
super(BatchGPModel, self).__init__(train_inputs, train_targets, likelihood)
ard_num_dims = train_inputs.shape[-1] if use_ard else None
self.shape = torch.Size([batch_shape])
self.mean_module = gpytorch.means.ConstantMean(batch_shape=self.shape, constant_constraint=gpytorch.constraints.Interval(0.0, 1.0))
self.base_kernel = gpytorch.kernels.RBFKernel(batch_shape=self.shape, ard_num_dims=ard_num_dims)
self.covar_module = gpytorch.kernels.ScaleKernel(self.base_kernel, batch_shape=self.shape)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# initialize the likelihood and prior, batch shape depends on the dimension of y (e.g. RGB image has 3 channels)
likelihood = gpytorch.likelihoods.GaussianLikelihood(batch_shape=torch.Size([batch_shape]))
model = BatchGPModel(train_x, train_y, likelihood, batch_shape=batch_shape, use_ard=True)
if torch.cuda.is_available():
train_x = train_x.cuda()
train_y = train_y.cuda()
model = model.cuda()
likelihood = likelihood.cuda()
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for i in range(50):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y).sum()
loss.backward()
optimizer.step()
print('Iter %d/%d - Loss: %.3f mean0: %.3f mean1: %.3f mean2: %.3f noise0: %.3f noise1: %.3f noise2: %.3f' % (
i + 1, 50, loss.item(),
model.mean_module.constant[0].item(),
model.mean_module.constant[1].item(),
model.mean_module.constant[2].item(),
model.likelihood.noise[0].item(),
model.likelihood.noise[1].item(),
model.likelihood.noise[2].item()
))
Otherwise, I get the following error:
/home/dell/anaconda3/envs/gpytorch/lib/python3.8/site-packages/linear_operator/utils/linear_cg.py:338: NumericalWarning: CG terminated in 1000 iterations with average residual norm 81.5545425415039 which is larger than the tolerance of 1 specified by linear_operator.settings.cg_tolerance. If performance is affected, consider raising the maximum number of CG iterations by running code in a linear_operator.settings.max_cg_iterations(value) context.
warnings.warn(
Iter 1/50 - Loss: 2.826 mean0: 0.525 mean1: 0.475 mean2: 0.475 noise0: 0.644 noise1: 0.744 noise2: 0.744
/home/dell/anaconda3/envs/gpytorch/lib/python3.8/site-packages/linear_operator/utils/linear_cg.py:338: NumericalWarning: CG terminated in 1000 iterations with average residual norm 14.917732238769531 which is larger than the tolerance of 1 specified by linear_operator.settings.cg_tolerance. If performance is affected, consider raising the maximum number of CG iterations by running code in a linear_operator.settings.max_cg_iterations(value) context.
warnings.warn(
Iter 2/50 - Loss: 2.931 mean0: 0.550 mean1: 0.469 mean2: 0.466 noise0: 0.607 noise1: 0.779 noise2: 0.780
/home/dell/anaconda3/envs/gpytorch/lib/python3.8/site-packages/linear_operator/utils/linear_cg.py:338: NumericalWarning: CG terminated in 1000 iterations with average residual norm 89.217041015625 which is larger than the tolerance of 1 specified by linear_operator.settings.cg_tolerance. If performance is affected, consider raising the maximum number of CG iterations by running code in a linear_operator.settings.max_cg_iterations(value) context.
warnings.warn(
Iter 3/50 - Loss: 2.931 mean0: 0.574 mean1: 0.475 mean2: 0.465 noise0: 0.581 noise1: 0.772 noise2: 0.809
/home/dell/anaconda3/envs/gpytorch/lib/python3.8/site-packages/linear_operator/utils/linear_cg.py:338: NumericalWarning: CG terminated in 1000 iterations with average residual norm 456.7450866699219 which is larger than the tolerance of 1 specified by linear_operator.settings.cg_tolerance. If performance is affected, consider raising the maximum number of CG iterations by running code in a linear_operator.settings.max_cg_iterations(value) context.
warnings.warn(
Iter 4/50 - Loss: 2.898 mean0: 0.598 mean1: 0.486 mean2: 0.469 noise0: 0.553 noise1: 0.768 noise2: 0.832
/home/dell/anaconda3/envs/gpytorch/lib/python3.8/site-packages/linear_operator/utils/linear_cg.py:338: NumericalWarning: CG terminated in 1000 iterations with average residual norm 36.63666915893555 which is larger than the tolerance of 1 specified by linear_operator.settings.cg_tolerance. If performance is affected, consider raising the maximum number of CG iterations by running code in a linear_operator.settings.max_cg_iterations(value) context.
warnings.warn(
Iter 5/50 - Loss: 2.851 mean0: 0.621 mean1: 0.497 mean2: 0.476 noise0: 0.527 noise1: 0.765 noise2: 0.853
Question Description
I am attempting to fit an exact GP regression on a dataset of ~ 10000 points. The
train_x
is3×7740×2
(repeated from the base shape7740×2
) and thetrain_y
is3×7740
where3
means the batch shape. Specifically, the input consists of 2-dimensional plane positions XY with values ranging from -14 to -18 as decimals. The output is normalized RGB colors with 3 dimensions, ranging from 0 to 1. The three regression tasks from XY to RGB are independent of each other.When following the Batch GP Regression tutorial: Training on the CPU: The code does not throw any errors, but it fails to converge and slows down as it runs. However, when I multiply the input
train_x
by 100, the Batch GP converges quickly and performs well. Training on the GPU: The following errors may occur, includingNaN loss
andNumericalWarning CG terminated
. I have tried to multiply the inputtrain_x
by 100 or normalize the input data using Min-Max Scaling, but they didn't work. When I set the data and model to be double precision, theNaN loss
disappeared, but it became very slow (of course for double precision) and couldn't converge to the right position.Is there an issue with my input data? It appears that this is indicative of numerical instability in the numerical computations. I guess something went wrong in computing the log likelihood here
The data can be downloaded from the attached .zip file. data.zip
Thanks in advance !
Here are the details about the data, code and error on GPU.
Data Example
code
Error Message