Bug while running - Githubissues

Nicny commented 5 years ago

Since I run train.py, something wrong happened loss.backward(retain_graph=True) File "/home/zhang/miniconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/zhang/miniconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 89, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: merge_sort: failed to synchronize: an illegal memory access was encountered THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCTensorCopy.c line=70 error=77 : an illegal memory access was encountered

environment: pytorch 0.4.0 cuda 9.0 cudnn 7.4 python 3.6

hillaric commented 5 years ago

I meet the same problem, have you solved it?

MrJimm commented 4 years ago

Same issue. RuntimeError: CUDA error: an illegal memory access was Can have it after model build, before training or after single epoch has been finished. Is there any solution?

jofury commented 4 years ago

Read somewhere that this is a problem with cuDnn. Temporarily, you can solve it by using whole number divisible batch size.

E.g. train set number is 9840, so you need to use batch size such as 16, 80, etc ..

edit: Found the root cause. It is due to slicing performed in the code. Here's the solution you may try:

1) Replace these lines of code in model.py:

KNN_Graph = torch.zeros(B, N, self.K, self.layers[0]).cuda() for idx, x in enumerate(X): self.KNN_Graph[idx] = self.createSingleKNNGraph(x)

to:

a = torch.pow(X, 2).sum(dim=-1, keepdim=True).repeat(1, 1, N) b = torch.pow(X, 2).sum(dim=-1, keepdim=True).repeat(1, 1, N).transpose(2,1)

dist_mat = a + b

distmat.baddbmm(X, X.transpose(2,1), beta=1, alpha=-2)

dist_mat_sorted, sorted_indices = torch.sort(dist_mat, dim=-1)

knn_indexes = sorted_indices[:,:, 1:self.K+1]

KNN_Graph = X[:,knn_indexes] KNN_Graph = KNN_Graph[torch.arange(B),torch.arange(B),:,:,:]

2) replace:

x_in = torch.cat([x1, x2], dim=3)

to

x_in = torch.cat([x1, x2], dim=-1)

ToughStoneX / DGCNN

Bug while running #1