NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.4k forks source link

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #112

Open franciscorubin opened 5 years ago

franciscorubin commented 5 years ago

I get the following error every time I try to do a forward call with apex:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-c83117740453> in <module>
      1 #%%pixie_debugger
      2 while True:
----> 3     train(verbose=False, optimize_memory=True, optimize_feature=False)
      4     with open('temp/memory.pkl', 'wb') as f:
      5         pickle.dump(net.memory_model.memory, f)

<ipython-input-19-7e6a3b51254d> in train(verbose, optimize_memory, optimize_feature)
     11         optimizer_both.zero_grad()
     12 
---> 13         similarities = net(batch_data)
     14 
     15         values, indices = similarities.max(1)

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

<ipython-input-13-fa199304f042> in forward(self, images)
     23         queries = self.feature_model(images)
     24         #print(queries)
---> 25         similarities = self.memory_model(queries)
     26 #        print(sorted(similarities, reverse=True))
     27         return similarities

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

<ipython-input-12-ea8dad5c6180> in forward(self, queries)
     44 
     45     def forward(self, queries):
---> 46         sim_vector = self.get_similarity_vectors(queries)
     47         return sim_vector

<ipython-input-12-ea8dad5c6180> in get_similarity_vectors(self, queries)
     39 
     40     def get_similarity_vectors(self, queries):
---> 41         similarity = self.apply_combined(queries, self.memory, self.head_model)
     42 #        print(similarity)
     43         return nn.functional.log_softmax(similarity * 10000) # multiply because of rounding errors

<ipython-input-12-ea8dad5c6180> in apply_combined(self, x, y, func)
     34         assert x.shape == y.shape
     35 
---> 36         res = func(x, y)
     37         res = res.view(n, m)
     38         return res

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/Projects/Personal/Kaggle/humpwin/pancho111203/siamese/model.py in forward(self, x, y)
    131         out = nn.functional.relu(out, inplace=True)
    132         out = out.permute((0, 3, 1, 2))
--> 133         out = self.conv2(out)
    134         out = out.view(batch_size, n_features)
    135 

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/miniconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

~/miniconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py in wrapper(*args, **kwargs)
     24                                      args,
     25                                      kwargs)
---> 26         return orig_fn(*new_args, **kwargs)
     27     return wrapper
     28 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

CUDNN logs: https://gist.github.com/pancho111203/3e91f0b46ab0be3b04f1edc9c1405684

mcarilli commented 5 years ago

This might be a cudnn issue, especially if you're using cudnn 7.2. Try

>>> import torch
>>> torch.backends.cudnn.version()

Upgrading your cudnn version may fix it: https://github.com/NVIDIA/apex/issues/78#issuecomment-440301134

Container options are

franciscorubin commented 5 years ago

I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.

njean78 commented 5 years ago

just having a similar issue : ` 318 def forward(self, input): 319 return F.conv2d(input, self.weight, self.bias, self.stride, --> 320 self.padding, self.dilation, self.groups) 321 322

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED`

I'm running on windows 10 using cudnn 7.4.2 + cuda 10 . Are the others having this problem running on windows or on linux ?

P.s. : i am using an NVIDIA TITAN RTX

franciscorubin commented 5 years ago

@njean78 I am running Linux Ubuntu 16.04, so it looks like the error is os-independent.

njean78 commented 5 years ago

solved my issue by installing pytorch for cuda 10 (got it from https://pytorch.org/). I was probably using the one for cuda 9...

mcarilli commented 5 years ago

I tried updating and unfortunately the error persists. The command you mentioned outputs 7401.

@pancho111203 Since you've got cuda 10 on bare metal (meaning your system has the cuda 10 driver) you should be using Pytorch for cuda 10. When you say "I tried updating" do you mean you only updated cudnn, or did you try running in one of the cuda 10 containers I mentioned?

moyans commented 5 years ago

if you runing pytorch in docker, you shuld know that: https://github.com/NVIDIA/tacotron2/issues/109

zhixuanli commented 5 years ago

Still having this problem RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Can anyone give me some help? Thanks a lot!

zhixuanli commented 5 years ago

Actually this will happen on gpu card 3, and it'll be fine on the other gpu cards.

I only use 1 gpu every time

ptrblck commented 5 years ago

@zhixuanli Which GPUs are you using and do you have a reproducible code snippet? Was apex installed successfully?

larifreitas commented 2 years ago

check the file path. It worked for me.