Closed m-julian closed 1 year ago
Also related to the bug above, if I set KERNEL_CHECKPOINT_SIZE = 0
, which should mean that no partitioning is used (as in the tutorial), then I get the following error:
Traceback (most recent call last):
File "..................../multi_gpu_kernel/example_multi_gpu_gpytorch/example_multi_gpu.py", line 110, in <module>
model, likelihood = train(train_x, train_y,
File "..................../multi_gpu_kernel/example_multi_gpu_gpytorch/example_multi_gpu.py", line 90, in train
loss.backward()
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/functions/_pivoted_cholesky.py", line 107, in backward
Krows = apply_permutation(matrix, full_permutation, short_permutation)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/utils/permutation.py", line 79, in apply_permutation
return delazify(matrix.__getitem__((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/lazy/lazy_tensor.py", line 2268, in __getitem__
res = self._get_indices(row_index, col_index, *batch_indices)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/lazy/cat_lazy_tensor.py", line 184, in _get_indices
return torch.cat(res_list).view(target_shape).to(self.device)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_cat)
One more issue that is related to this. Using a larger preconditioner size also breaks down because the rows that are returned by apply_permutation
can be on different devices, so then row.gather
results in a Runtime error. Using a very small preconditioner size (I tried setting the preconditioner size to 1 and 2) did not give that error below as the rows that were returned were on the same device.
Traceback (most recent call last):
File "/...................../multi_gpu_kernel/example_multi_gpu_gpytorch/example_multi_gpu.py", line 110, in <module>
model, likelihood = train(train_x, train_y,
File "/...................../multi_gpu_kernel/example_multi_gpu_gpytorch/example_multi_gpu.py", line 89, in train
loss = closure()
File "/...................../multi_gpu_kernel/example_multi_gpu_gpytorch/example_multi_gpu.py", line 86, in closure
loss = -mll(output, train_y)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/module.py", line 30, in __call__
outputs = self.forward(*inputs, **kwargs)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 62, in forward
res = output.log_prob(target)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/distributions/multivariate_normal.py", line 169, in log_prob
inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/lazy/lazy_tensor.py", line 1338, in inv_quad_logdet
preconditioner, precond_lt, logdet_p = self._preconditioner()
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/lazy/added_diag_lazy_tensor.py", line 100, in _preconditioner
self._piv_chol_self = self._lazy_tensor.pivoted_cholesky(rank=max_iter)
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/lazy/lazy_tensor.py", line 1538, in pivoted_cholesky
res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
File "...................../.venv/venv_gpytorch/lib/python3.9/site-packages/gpytorch/functions/_pivoted_cholesky.py", line 70, in forward
L_m_new = row.gather(-1, pi_i)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_gather)
Closing because checkpointing is now deprecated (as of v1.11)
🐛 Bug
I am replicating the Exact GP Regression with Multiple GPUs and Kernel Partitioning notebook. However, seems like making the kernel partition size larger than the number of training data (I have reduced the training data size from the example notebook) results in an error when accessing gradients.
To reproduce
Code snippet to reproduce
Stack trace/error message
Expected Behavior
sub_x1.grad
should not beNoneType
, but it should instead contain the gradinet and model should be made. When I change the variableKERNEL_CHECKPOINT_SIZE = 3000
(so less than the training set size), then a model is made and the correct output is produced:System information