Spandan-Madan / Pytorch_fine_tuning_Tutorial

A short tutorial on performing fine tuning or transfer learning in PyTorch.
278 stars 63 forks source link

error with GPU #2

Closed HIN0209 closed 7 years ago

HIN0209 commented 7 years ago

Hello,

It is a great post. I used my custom 2-class dataset and worked well with resnet50 with CPU. However, with 1 or 2 GPU(s), there was a following error. Please advise.

Setting: Ubuntu 14.04, python 3.5.4 pyTorch 0.2 via conda (same error with 0.1.12_2 via pip) CUDA 8.0.61 NVIDIA GeForce 1080Ti x2


(15 , 2 ,.,.) = -1.5081 -1.4036 -1.4036 ... -1.6476 -1.6999 -1.6476 -1.6127 -1.5081 -1.5604 ... -1.5604 -1.5081 -1.6476 -1.6127 -1.6476 -1.6476 ... -1.5779 -1.5430 -1.5081 ... ⋱ ...
-1.6476 -1.6302 -1.7173 ... -0.9156 -1.1421 -1.2641 -1.6650 -1.6127 -1.6476 ... -0.5495 -0.9853 -1.2467 -1.4907 -1.6127 -1.4733 ... -1.6127 -1.2293 -1.5430 [torch.FloatTensor of size 16x3x224x224]

1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 0 [torch.LongTensor of size 16]

Traceback (most recent call last): File "main.py", line 260, in num_epochs=100) File "main.py", line 177, in train_model outputs = model(inputs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(*input, *kwargs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torchvision/models/resnet.py", line 139, in forward x = self.conv1(x) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(input, **kwargs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 237, in forward self.padding, self.dilation, self.groups) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/functional.py", line 40, in conv2d return f(input, weight, bias) TypeError: argument 0 is not a Variable

Spandan-Madan commented 7 years ago

One possible reason is that I had written this on PyTorch 0.12, and you might be on 0.2 which was released recently. I tried looking up forums for the error you mentioned, but didn't find much. Basically what seems to be happening is that your input is the scalar 0, instead of a PyTorch variable.

First up, your input and target are not on cuda which doesn't make sense to me. Did you change the file/code, because even the error mentions a file main.py, which does not exist in this repository. If you did change the code, I think it's breaking on these lines and you can try debugging this part

HIN0209 commented 7 years ago

Hello, Thank you for your input. First, sorry about confusing you that I changed the file name to main.py, just to save my typing effort without changing the content.

I tried several datasets that worked fine with cpu, but not with gpu, so that the issue is not the dataset. Pytorch 0.1.12.post2 (via pip)yielded the same result.

Possibly unrelated, but I just found that the funcion "torch.cuda.set_device(device)" is discouraged. http://pytorch.org/docs/master/cuda.html

Spandan-Madan commented 7 years ago

I'm sure set_device() isn't a problem in this case as I routinely use it in my research work! But thanks for pointing this out, I'll switch to the other suggested alternative. I'll look into this bug, I just tried it on my GPU and it works. I'll keep playing around and see if I run into any other error.

Since you haven't changed any code as you said, I'm sure there is some bug. What you posted is the complete error trace, right? If not can you post that too? Thanks!

Spandan-Madan commented 7 years ago

Another thing you should check for is if both the input and target are float64. Sometimes, one of the two can be float32 and that can lead to errors. This is dataset specific so I can't check this myself and you'll have to try it.

In case there is a difference, change the type to float64 and see if the error persists.

HIN0209 commented 7 years ago

How quick!! Thank again for the response. (1) The full message by executing "python main_fine_tuning.py" is shown below. (2) my CUDA works well, at least by trying a sample code. "PyTorch: Tensors" in https://github.com/jcjohnson/pytorch-examples (3) NVIDIA-smi shows Driver Version: 375.74 (4) changing from float() to float64() showed the same error. if use_gpu: try: inputs, labels = Variable(inputs.float64().cuda()),
Variable(labels.long().cuda())

-------------------------------------error message below-------------- Epoch 0/99

LR is set to 0.001

( 0 , 0 ,.,.) = -0.9534 -1.1418 -1.2788 ... -1.1589 -1.1247 -1.1075 -1.2103 -1.3644 -1.5014 ... -1.0904 -1.0733 -0.9877 -1.3815 -1.5185 -1.6384 ... -1.0562 -1.0733 -1.0219 ... ⋱ ...
0.0056 0.0056 -0.0972 ... -1.9124 -1.9124 -1.7754 -0.0629 -0.0458 -0.1314 ... -1.8097 -1.8439 -1.8610 -0.3369 -0.2856 -0.2684 ... -1.8782 -1.8782 -1.8782

( 0 , 1 ,.,.) = -1.0028 -1.1954 -1.3354 ... -1.2829 -1.2304 -1.1604 -1.2654 -1.4230 -1.5630 ... -1.2129 -1.1779 -1.0378 -1.4230 -1.5455 -1.6681 ... -1.1779 -1.1779 -1.0728 ... ⋱ ...
0.5728 0.5728 0.4678 ... 0.0126 0.0126 0.1527 0.5028 0.5203 0.4328 ... 0.1176 0.0826 0.0651 0.2227 0.2752 0.2927 ... 0.0476 0.0476 0.0476

( 0 , 2 ,.,.) = -0.8981 -1.0724 -1.2119 ... -1.1596 -1.1073 -1.0550 -1.1247 -1.2641 -1.4036 ... -1.0898 -1.0550 -0.9330 -1.2467 -1.3687 -1.4907 ... -1.0550 -1.0724 -0.9678 ... ⋱ ...
-0.8284 -0.8110 -0.9330 ... -0.7587 -0.7587 -0.6193 -0.8807 -0.8633 -0.9504 ... -0.6541 -0.6890 -0.7064 -1.1421 -1.0898 -1.0724 ... -0.7238 -0.7238 -0.7238 ⋮

( 1 , 0 ,.,.) = -1.0390 -0.9020 -0.6623 ... 1.4954 1.4954 1.4954 -0.9877 -0.9534 -0.7993 ... 1.4954 1.4954 1.4954 -0.9192 -1.0219 -0.9877 ... 1.4783 1.4783 1.4783 ... ⋱ ...
-1.7583 -1.8268 -1.8610 ... -1.4329 -1.3987 -1.3473 -1.7069 -1.7583 -1.8097 ... -1.3815 -1.3644 -1.3302 -1.6898 -1.7069 -1.7925 ... -1.3644 -1.3473 -1.3130

( 1 , 1 ,.,.) = -1.0028 -0.8627 -0.6176 ... 1.5007 1.5007 1.5007 -0.9503 -0.9153 -0.7577 ... 1.5007 1.5007 1.5007 -0.8803 -0.9853 -0.9503 ... 1.4832 1.4832 1.4832 ... ⋱ ...
-1.6681 -1.7206 -1.7381 ... -1.3179 -1.2829 -1.2479 -1.6155 -1.6506 -1.6856 ... -1.2654 -1.2479 -1.2304 -1.5980 -1.5980 -1.6681 ... -1.2479 -1.2304 -1.2129

( 1 , 2 ,.,.) = -0.9678 -0.8284 -0.5495 ... 1.4200 1.4200 1.4200 -0.9156 -0.8807 -0.6890 ... 1.4200 1.4200 1.4200 -0.8458 -0.9504 -0.8807 ... 1.4025 1.4025 1.4025 ... ⋱ ...
-1.5779 -1.5953 -1.5953 ... -1.1944 -1.1596 -1.1596 -1.5256 -1.5256 -1.5430 ... -1.1421 -1.1247 -1.1421 -1.5081 -1.4733 -1.5256 ... -1.1247 -1.1073 -1.1247 ⋮

( 2 , 0 ,.,.) = -1.2274 -1.2274 -1.1247 ... 0.1426 0.0227 0.0227 -1.2445 -1.2445 -1.1760 ... -0.0458 -0.0629 0.0741 -1.2445 -1.2445 -1.2445 ... -0.1999 -0.1314 0.0741 ... ⋱ ...
-0.0629 -0.0116 0.0227 ... -0.0116 -0.0116 -0.0287 -0.0287 0.0741 0.0398 ... -0.0287 -0.0458 -0.0629 -0.0287 0.0398 0.0227 ... -0.0116 -0.0458 -0.0629

( 2 , 1 ,.,.) = -0.9328 -0.9328 -0.8277 ... 0.6954 0.6254 0.6954 -0.9503 -0.9503 -0.8803 ... 0.5028 0.5553 0.7479 -0.9503 -0.9503 -0.9503 ... 0.3277 0.4678 0.7479 ... ⋱ ...
0.3978 0.4853 0.5203 ... 0.8529 0.8529 0.8354 0.4328 0.5728 0.5378 ... 0.8354 0.8179 0.8004 0.4328 0.5378 0.5378 ... 0.8529 0.8179 0.8004

( 2 , 2 ,.,.) = -0.6367 -0.6367 -0.5321 ... 0.9494 0.9145 1.0017 -0.6541 -0.6541 -0.5844 ... 0.7402 0.8448 1.0539 -0.6541 -0.6541 -0.6541 ... 0.5834 0.7576 1.0539 ... ⋱ ...
0.5485 0.6182 0.6531 ... 1.5071 1.5071 1.4897 0.5834 0.7054 0.6705 ... 1.4897 1.4722 1.4548 0.5834 0.6705 0.6531 ... 1.5071 1.4722 1.4548 ...

(13 , 0 ,.,.) = -1.7240 -1.7240 -1.6727 ... -0.7822 -1.0048 -1.0048 -1.7754 -1.8268 -1.7412 ... -0.7993 -1.0562 -1.0219 -1.7583 -1.8268 -1.7754 ... -0.7822 -1.0904 -1.0562 ... ⋱ ...
-1.4500 -1.4158 -1.4843 ... 0.6906 0.6049 0.5707 -1.4158 -1.4158 -1.5014 ... 0.7248 0.5878 0.5878 -1.4500 -1.4672 -1.5528 ... 0.7248 0.6392 0.6563

(13 , 1 ,.,.) = -1.7381 -1.7031 -1.5980 ... -0.7227 -0.9503 -1.0028 -1.7906 -1.8081 -1.6506 ... -0.7402 -1.0028 -1.0203 -1.7731 -1.7906 -1.6681 ... -0.7052 -1.0378 -1.0553 ... ⋱ ...
-1.3704 -1.3354 -1.4055 ... 0.7654 0.6779 0.6254 -1.3354 -1.3354 -1.4230 ... 0.8004 0.6604 0.6429 -1.3704 -1.3880 -1.4755 ... 0.8004 0.6954 0.7129

(13 , 2 ,.,.) = -1.4036 -1.3861 -1.2816 ... -0.6715 -0.9504 -0.8458 -1.4733 -1.5081 -1.3687 ... -0.6890 -0.9853 -0.8458 -1.4733 -1.5081 -1.4036 ... -0.6715 -1.0027 -0.8633 ... ⋱ ...
-1.0376 -1.0027 -1.0724 ... 0.8274 0.7402 0.7228 -1.0027 -1.0027 -1.0898 ... 0.8622 0.7228 0.7576 -1.0376 -1.0550 -1.1421 ... 0.8622 0.7751 0.8099 ⋮

(14 , 0 ,.,.) = -0.9192 -1.0048 -1.1075 ... 1.3927 1.3927 1.4098 -1.0219 -1.0733 -1.1247 ... 1.3413 1.4098 1.4440 -1.1075 -1.1418 -1.1418 ... 1.3584 1.4098 1.4098 ... ⋱ ...
-1.8268 -1.8097 -1.7583 ... 1.1358 1.0159 1.0844 -1.8097 -1.7754 -1.7240 ... 1.4269 1.1700 1.4269 -1.7240 -1.7069 -1.7412 ... 1.3070 1.1700 1.3755

(14 , 1 ,.,.) = -0.8803 -0.9678 -1.0728 ... 0.8354 0.8354 0.8529 -0.9853 -1.0378 -1.0903 ... 0.8004 0.8529 0.8880 -1.0728 -1.1078 -1.1078 ... 0.8354 0.8880 0.8880 ... ⋱ ...
-1.7206 -1.7031 -1.6506 ... 1.5357 1.4132 1.4832 -1.7031 -1.6681 -1.6155 ... 1.8333 1.5707 1.8333 -1.6155 -1.5980 -1.6331 ... 1.7108 1.5707 1.7808

(14 , 2 ,.,.) = -0.8633 -0.9504 -1.0550 ... 0.5311 0.5311 0.5485 -0.9678 -1.0201 -1.0724 ... 0.4962 0.5485 0.5834 -1.0550 -1.0898 -1.0898 ... 0.5136 0.5659 0.5659 ... ⋱ ...
-1.5953 -1.5779 -1.5256 ... 1.7685 1.6465 1.7163 -1.5953 -1.5604 -1.5081 ... 2.0648 1.8034 2.0648 -1.5256 -1.5081 -1.5430 ... 1.9428 1.8034 2.0125 ⋮

(15 , 0 ,.,.) = -0.9363 -0.9192 -0.9020 ... -1.9809 -1.9809 -1.9809 -0.9192 -0.9192 -0.9020 ... -1.9809 -1.9809 -1.9809 -0.9192 -0.9020 -0.9020 ... -1.9809 -1.9809 -1.9809 ... ⋱ ...
-0.8678 -0.7479 -0.7479 ... -1.1589 -1.2445 -1.1418 -0.7822 -0.6623 -0.6109 ... -1.1760 -1.2274 -1.1247 -0.1999 -0.5767 -0.6281 ... -1.2103 -1.2274 -1.1760

(15 , 1 ,.,.) = -0.9678 -0.9503 -0.9328 ... -1.8957 -1.8957 -1.8957 -0.9503 -0.9503 -0.9328 ... -1.8957 -1.8957 -1.8957 -0.9503 -0.9328 -0.9328 ... -1.8957 -1.8957 -1.8957 ... ⋱ ...
-0.8978 -0.7927 -0.7927 ... -1.0553 -1.1429 -1.0378 -0.8803 -0.7227 -0.6702 ... -1.0728 -1.1253 -1.0203 -0.2850 -0.6352 -0.6877 ... -1.1078 -1.1253 -1.0728

(15 , 2 ,.,.) = -0.9330 -0.9156 -0.8981 ... -1.6650 -1.6650 -1.6650 -0.9156 -0.9156 -0.8981 ... -1.6650 -1.6650 -1.6650 -0.9156 -0.8981 -0.8981 ... -1.6650 -1.6650 -1.6650 ... ⋱ ...
-0.7064 -0.5844 -0.6018 ... -0.8284 -0.9156 -0.8110 -0.6193 -0.5147 -0.4624 ... -0.8458 -0.8981 -0.7936 -0.0267 -0.3927 -0.4798 ... -0.8807 -0.8981 -0.8458 [torch.FloatTensor of size 16x3x224x224]

0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 [torch.LongTensor of size 16]

Traceback (most recent call last): File "main_fine_tuning.py", line 260, in num_epochs=100) File "main_fine_tuning.py", line 177, in train_model outputs = model(inputs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(*input, *kwargs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torchvision/models/resnet.py", line 139, in forward x = self.conv1(x) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(input, **kwargs) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 237, in forward self.padding, self.dilation, self.groups) File "/home/owner/anaconda3/envs/pytorch_ssd/lib/python3.5/site-packages/torch/nn/functional.py", line 40, in conv2d return f(input, weight, bias) TypeError: argument 0 is not a Variable

HIN0209 commented 7 years ago

Just follow-up, I used a part of the dog-cat dataset from Kaggle.

HIN0209 commented 7 years ago

I solved. I checked with another fine-tuning code below that the following was replaced from your original and worked. Does it make sense??

https://github.com/meliketoy/fine-tuning.pytorch

original (not working). if use_gpu: try:

inputs, labels = Variable(inputs.float().cuda()),

    #Variable(labels.long().cuda())

new (and working). if use_gpu: try: inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())

HIN0209 commented 7 years ago

After becoming happy to see the fast moving system with GPU, if I can request, how can the long long messaging be deleted/hidden, because I miss more important info due to these. Thanks!

"loss backward done loss backward done optim loss done loss backward done loss backward ..."

Spandan-Madan commented 7 years ago

Glad that you solved it! I will add a note to the README that people can comment out the print statements if they want to use this script directly. This is a tutorial so that people can understand what's happening as opposed to a script that people can plug and play directly without understanding so I'll leave the print statements :)

Closing the issue since it's solved! :)