cornellius-gp / gpytorch

A highly efficient implementation of Gaussian Processes in PyTorch
MIT License
3.57k stars 560 forks source link

[Bug] GPU Out of Memory with ProductKernel(GridInterpKernel()) #747

Open wjmaddox opened 5 years ago

wjmaddox commented 5 years ago

🐛 Bug

Dealing with an issue where the memory usage on the CPU is relatively small, but GPU memory usage seems considerably bigger. It seems to be worse due to the relatively large number of trace samples I need in my application.

To reproduce

Code snippet to reproduce

# generate data
n = 300
train_x = torch.zeros(pow(n, 2), 2)
for i in range(n):
    for j in range(n):
        train_x[i * n + j][0] = float(i) / (n-1)
        train_x[i * n + j][1] = float(j) / (n-1)
# True function is sin( 2*pi*(x0+x1))
train_y = torch.sin((train_x[:, 0] + train_x[:, 1]) * (2 * math.pi)) + torch.randn_like(train_x[:, 0]).mul(0.01)

in_dims = 1 if train_x.dim() == 1 else train_x.size(1)

# define SKI model
class GPRegressionModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)

        # SKI requires a grid size hyperparameter. This util can help with that
        grid_size = [250, 250]

        self.mean_module = gpytorch.means.ConstantMean()
        base = gpytorch.kernels.RBFKernel

        grid_kernels = [gpytorch.kernels.GridInterpolationKernel(
                base(),num_dims = train_x.size(-1), active_dims = (d),
            grid_size = grid_size[d]
        ) for d in range(train_x.size(-1))]

        self.covar_module = gpytorch.kernels.ProductKernel(*grid_kernels)        

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# running on cpu is fine
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

likelihood.train()
model.train()

with gpytorch.settings.num_trace_samples(100), torch.no_grad():
    print(likelihood(model(train_x)).log_prob(train_y))

# put data on cuda
use_cuda = torch.cuda.is_available()
print('Cuda is available', use_cuda)
if use_cuda:
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
    train_x, train_y  = train_x.cuda(), train_y.cuda()

# attempt to run on gpu
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

likelihood.train()
model.train()

with gpytorch.settings.num_trace_samples(100), torch.no_grad():
    print(likelihood(model(train_x)).log_prob(train_y))

Stack trace/error message

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/lazy/lazy_tensor.py in _solve(self, rhs, preconditioner, num_tridiag)
    628             max_iter=settings.max_cg_iterations.value(),
    629             max_tridiag_iter=settings.max_lanczos_quadrature_iterations.value(),
--> 630             preconditioner=preconditioner,
    631         )
    632 

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/utils/linear_cg.py in linear_cg(matmul_closure, rhs, n_tridiag, tolerance, eps, stop_updating_after, max_iter, max_tridiag_iter, initial_guess, preconditioner)
    202         # Get next alpha
    203         # alpha_{k} = (residual_{k-1}^T precon_residual{k-1}) / (p_vec_{k-1}^T mat p_vec_{k-1})
--> 204         mvms = matmul_closure(curr_conjugate_vec)
    205         if precond:
    206             torch.mul(curr_conjugate_vec, mvms, out=mul_storage)

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/lazy/added_diag_lazy_tensor.py in _matmul(self, rhs)
     38 
     39     def _matmul(self, rhs):
---> 40         return torch.addcmul(self._lazy_tensor._matmul(rhs), self._diag_tensor._diag.unsqueeze(-1), rhs)
     41 
     42     def add_diag(self, added_diag):

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/lazy/mul_lazy_tensor.py in _matmul(self, rhs)
     55             # Now implement the formula (A . B) v = diag(A D_v B)
     56             left_res = left_res.view(*output_batch_shape, n, rank * m)
---> 57             left_res = self.right_lazy_tensor._matmul(left_res)
     58             left_res = left_res.view(*output_batch_shape, n, rank, m)
     59             res = left_res.mul_(left_root.unsqueeze(-1)).sum(-2)

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/lazy/root_lazy_tensor.py in _matmul(self, rhs)
     52 
     53     def _matmul(self, rhs):
---> 54         return self.root._matmul(self.root._t_matmul(rhs))
     55 
     56     def _t_matmul(self, rhs):

~/anaconda3/envs/home/lib/python3.7/site-packages/gpytorch-0.3.2-py3.7.egg/gpytorch/lazy/non_lazy_tensor.py in _matmul(self, rhs)
     38 
     39     def _matmul(self, rhs):
---> 40         return torch.matmul(self.tensor, rhs)
     41 
     42     def _prod_batch(self, dim):

RuntimeError: CUDA out of memory. Tried to allocate 3.39 GiB (GPU 0; 10.73 GiB total capacity; 3.71 GiB already allocated; 2.86 GiB free; 3.32 GiB cached)

Expected Behavior

https://github.com/cornellius-gp/gpytorch/blob/4c71e54f519fe1c51db8d0609c5a88532039f17b/gpytorch/lazy/mul_lazy_tensor.py#L49

has a direct evaluation for root tensor. This creates a much larger matrix than is desired in the following line (in the example 90k x 10100), but it's possible this is necessary for evaluation.

System information

Please complete the following information:

Additional context

A possible resolution is to instead call GridInterpolationKernel(ProductKernel(RBFKernel() ) ).

gpleiss commented 5 years ago

You should use a ProductStructureKernel rather than a ProductKernel. I think it should be more memory efficient.

See https://github.com/cornellius-gp/gpytorch/blob/master/examples/05_Scalable_GP_Regression_Multidimensional/Scalable_Kernel_Interpolation_for_Products_CUDA.ipynb

wjmaddox commented 5 years ago

If possible I'd actually like to have separate parameters for the dimensions of the kernel. This seems to rule out using ProductStructureKernel which in my understanding shares parameters across dimensions.

gpleiss commented 5 years ago

You can have different parameters for different dimensions by setting ard_num_dims on the base kernel.

wjmaddox commented 5 years ago

Thanks, I'll look into trying that for my use case.

wjmaddox commented 5 years ago

I seem to also be getting memory errors on the CPU at a grid size of 250 when switching out the covariance module for the following:

self.covar_module = gpytorch.kernels.ProductStructureKernel( gpytorch.kernels.GridKernel( gpytorch.kernels.RBFKernel(ard_num_dims = 2), grid = train_x), num_dims = 2 )

Stack trace

~/Documents/Code/gpytorch/gpytorch/kernels/kernel.py in _sq_dist(self, x1, x2, postprocess, x1_eq_x2)
     38             x1_ = torch.cat([-2. * x1, x1_norm, x1_pad], dim=-1)
     39             x2_ = torch.cat([x2, x2_pad, x2_norm], dim=-1)
---> 40             res = x1_.matmul(x2_.transpose(-2, -1))
     41 
     42             if x1_eq_x2:

RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0