Open Bibyutatsu opened 4 years ago
I am also getting a RuntimeError
while running the tutorial. But this occurs when I put checkpoint_size=10000
.
But when I put checkpoint_size=0
, The training completes without any issues.
This is the exact traceback in my case:
model, likelihood = train(train_x, train_y,
n_devices=n_devices, output_device=output_device,
checkpoint_size=10000,
preconditioner_size=100,
n_training_iter=20)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-16-dba074377e42> in <module>
3 checkpoint_size=10000,
4 preconditioner_size=100,
----> 5 n_training_iter=20)
<ipython-input-14-4da03f9af78d> in train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter)
43
44 loss = closure()
---> 45 loss.backward()
46
47 for i in range(n_training_iter):
/opt/conda/envs/torchenv/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
196 products. Defaults to ``False``.
197 """
--> 198 torch.autograd.backward(self, gradient, retain_graph, create_graph)
199
200 def register_hook(self, hook):
/opt/conda/envs/torchenv/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
--> 100 allow_unreachable=True) # allow_unreachable flag
101
102
RuntimeError: start (1250) + length (1250) exceeds dimension size (1250).
Are you also using 8 devices in the gpytorch tutorial?
Hi KeAWang, Yes I am using 8 devices.
It's a bit hard for me to reproduce this issue at this moment as I don't have access to 8 GPUs. Are you able to reproduce this on say, 1 GPU with checkpointing?
Hi KeAWang, No for 1 GPU it is working correctly, this error occurs when I use multiple GPUs.
I am also getting a RuntimeError
while running the tutorial. But this occurs when I put checkpoint_size=10000
.
But when I put checkpoint_size=0
, The training completes without any issues.
🐛 Bug
Hi, I wanted to train
SingleTaskGP
on multiple GPUs as I have got 8 cards on my node. So I searched and found out about the Gpytorch'sMultiDeviceKernel
kernel which can be used to accomplish this task. But I couldn't find anything similar in the Botorch modules. So I changed thecovar_module
of theSingleTaskGP
to use this specific kernel. But I am getting the Bug:I am unable to train my BoTorch's
SingleTaskGP
on multiple GPUs with Gpytorch'sMultiDeviceKernel
kernel.To reproduce
Code snippet to reproduce
Stack trace/error message
Expected Behavior
I expected the usage of all the GPUs to train the model so that I can scale it across multiple GPUs for a faster execution and sampling.
System information
Please complete the following information: