Eric-mingjie / rethinking-network-pruning

Rethinking the Value of Network Pruning (Pytorch) (ICLR 2019)
MIT License
1.5k stars 295 forks source link

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED #28

Open jefersonf opened 4 years ago

jefersonf commented 4 years ago

I've installed the correct requirements. But after running this: python main.py --dataset cifar10 --arch vgg --depth 16

I'm getting the following error:

Traceback (most recent call last):
  File "main.py", line 166, in <module>
    train(epoch)
  File "main.py", line 125, in train
    output = model(data)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/cifar/l1-norm-pruning/models/vgg.py", line 56, in forward
    x = self.feature(x)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
    self.padding, self.dilation, self.groups)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Am I doing something wrong?

Eric-mingjie commented 4 years ago

Are you using torch v0.3.1?

jefersonf commented 4 years ago

Are you using torch v0.3.1?

Yes! I created a virtual environment in which I installed the required dependencies. When I tried to run the above command I fell into a problem with the CUDA_VERSION.

... requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org 

So I went to python.org and got the following pytorch build version https://download.pytorch.org/whl/cu90/torch-0.3.1-cp36-cp36m-linux_x86_64.whl and it starts to run but the above error occurs.

Eric-mingjie commented 4 years ago

Not sure what your problem is. Does running resnet also cause the same problem? If so, then it may be something related to your environment.

jefersonf commented 4 years ago

Not sure what your problem is. Does running resnet also cause the same problem? If so, then it may be something related to your environment.

It happens the same. I think the problem is related to my cuda version, which is currently version 10.0.

Eric-mingjie commented 4 years ago

Maybe the CUDA version is the cause.

jefersonf commented 4 years ago

When I use torch v0.4.0, despite some ~errors~ warnings like the ones shown below, training apparently starts normally.

rethinking-network-pruning/cifar/l1-norm-pruning$ python main.py --dataset cifar10 --arch vgg --depth 16
Files already downloaded and verified
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=844 error=11 : invalid argument
main.py:127: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  avg_loss += loss.data[0]
main.py:135: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  100. * batch_idx / len(train_loader), loss.data[0]))
Train Epoch: 0 [0/50000 (0.0%)] Loss: 2.304134
Train Epoch: 0 [6400/50000 (12.8%)] Loss: 2.072424
Train Epoch: 0 [12800/50000 (25.6%)]    Loss: 1.590183
Train Epoch: 0 [19200/50000 (38.4%)]    Loss: 1.610837
Train Epoch: 0 [25600/50000 (51.2%)]    Loss: 1.439363
Train Epoch: 0 [32000/50000 (63.9%)]    Loss: 1.510963
Train Epoch: 0 [38400/50000 (76.7%)]    Loss: 1.700880
Train Epoch: 0 [44800/50000 (89.5%)]    Loss: 1.284431
main.py:144: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  data, target = Variable(data, volatile=True), Variable(target)
main.py:146: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  test_loss += F.cross_entropy(output, target, size_average=False).data[0] # sum up batch loss

Test set: Average loss: 1.1723, Accuracy: 5790/10000 (57.0%)

Train Epoch: 1 [0/50000 (0.0%)] Loss: 1.202878
Train Epoch: 1 [6400/50000 (12.8%)] Loss: 1.249835
Train Epoch: 1 [12800/50000 (25.6%)]    Loss: 1.410513
Train Epoch: 1 [19200/50000 (38.4%)]    Loss: 1.196959
...
Eric-mingjie commented 4 years ago

Okay. Then it may be that CUDA 10.0 is more suitable for torch v0.4.0.

songheony commented 4 years ago

If you want to use CUDA >= 9.2, The code should be changed for Pytorch 0.4.1. Here is the source code I converted https://github.com/songheony/rethinking-network-pruning. I've tested it with Pytorch 1.2, CUDA 10

jefersonf commented 4 years ago

If you want to use CUDA >= 0.92, The code should be changed for Pytorch 0.4.1. Here is the source code I converted https://github.com/songheony/rethinking-network-pruning. I've tested it with Pytorch 1.2, CUDA 10

I'll check it out!