Open Nicny opened 5 years ago
I meet the same problem, have you solved it?
Same issue. RuntimeError: CUDA error: an illegal memory access was Can have it after model build, before training or after single epoch has been finished. Is there any solution?
Read somewhere that this is a problem with cuDnn. Temporarily, you can solve it by using whole number divisible batch size.
E.g. train set number is 9840, so you need to use batch size such as 16, 80, etc ..
edit: Found the root cause. It is due to slicing performed in the code. Here's the solution you may try:
1) Replace these lines of code in model.py:
KNN_Graph = torch.zeros(B, N, self.K, self.layers[0]).cuda() for idx, x in enumerate(X): self.KNN_Graph[idx] = self.createSingleKNNGraph(x)
to:
a = torch.pow(X, 2).sum(dim=-1, keepdim=True).repeat(1, 1, N) b = torch.pow(X, 2).sum(dim=-1, keepdim=True).repeat(1, 1, N).transpose(2,1)
dist_mat = a + b
distmat.baddbmm(X, X.transpose(2,1), beta=1, alpha=-2)
dist_mat_sorted, sorted_indices = torch.sort(dist_mat, dim=-1)
knn_indexes = sorted_indices[:,:, 1:self.K+1]
KNN_Graph = X[:,knn_indexes] KNN_Graph = KNN_Graph[torch.arange(B),torch.arange(B),:,:,:]
2) replace:
x_in = torch.cat([x1, x2], dim=3)
to
x_in = torch.cat([x1, x2], dim=-1)
Since I run train.py, something wrong happened loss.backward(retain_graph=True) File "/home/zhang/miniconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/zhang/miniconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 89, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCTensorCopy.c line=70 error=77 : an illegal memory access was encountered
environment: pytorch 0.4.0 cuda 9.0 cudnn 7.4 python 3.6