When running the train_video.py example using the option --cuda I get the following error:
RuntimeError: CUDA error: an illegal memory access was encountered
To Reproduce
Steps to reproduce the behavior:
When running:
python train_video.py -d data --batch-size 16 -lr 1e-4 --cuda --save
I get the following error:
Traceback (most recent call last):
File "/home/user/videocompression/main.py", line 473, in <module>
main(sys.argv[1:])
File "/home/user/videocompression/main.py", line 442, in main
train_one_epoch(
File "/home/user/videocompression/main.py", line 236, in train_one_epoch
out_criterion["loss"].backward()
File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/autograd/function.py", line 89, in apply
return self._forward_cls.backward(self, *args) # type: ignore
File "/home/user/videocompression/compressai/compressai/layers/layers.py", line 293, in backward
grad_input[input < 0] = grad_sub[input < 0]
RuntimeError: CUDA error: an illegal memory access was encountered
Expected behavior
The training completes without any issue.
Environment
Please copy and paste the output from python3 -m torch.utils.collect_env
By carefully debugging the script I found that the bug might be caused by the flow variable created at video/google.py#L378
Indeed the access to this variable causes a CUDNN_not_initialized error and replacing it with a torch.rand_like(flow) tensor solves the error.
As a workaround, I temporary replaced that line as follows:
def forward_prediction(self, x_ref, motion_info):
#FIXME this cause cuDNN error: CUDNN_STATUS_NOT_INITIALIZED. Maybe cudnn version problem
#flow, scale_field = motion_info.chunk(2, dim=1)
#TODO: possible workaround:
flow, scale_field = motion_info[:,:2,:,:], motion_info[:,-1,:,:].unsqueeze(1)
volume = self.gaussian_volume(x_ref, self.sigma0, self.num_levels)
x_pred = self.warp_volume(volume, flow, scale_field) # volume, [gx,gy], gz
return x_pred
Such workaround seems to work (the training starts and complete with no further errors), even I am not 100% sure that those two operations are equivalents.
I think this problem may be related to the cudnn intallation on the server I am using for the experiments. I wonder if someone else is experiencing a similar issue.
Bug
When running the train_video.py example using the option --cuda I get the following error:
To Reproduce
Steps to reproduce the behavior: When running:
python train_video.py -d data --batch-size 16 -lr 1e-4 --cuda --save
I get the following error:
Expected behavior
The training completes without any issue.
Environment
Please copy and paste the output from
python3 -m torch.utils.collect_env
Additional context
By carefully debugging the script I found that the bug might be caused by the flow variable created at video/google.py#L378 Indeed the access to this variable causes a
CUDNN_not_initialized
error and replacing it with atorch.rand_like(flow)
tensor solves the error.As a workaround, I temporary replaced that line as follows:
Such workaround seems to work (the training starts and complete with no further errors), even I am not 100% sure that those two operations are equivalents.
I think this problem may be related to the cudnn intallation on the server I am using for the experiments. I wonder if someone else is experiencing a similar issue.