ProGamerGov / neural-style-pt

PyTorch implementation of neural style transfer algorithm
MIT License
833 stars 178 forks source link

RuntimeError: CUDA error: invalid device ordinal with starry_stanford.sh #70

Open gateway opened 4 years ago

gateway commented 4 years ago

Hi, I'm trying to run the script above to see if my system can handle and create larger images based upon your script.

I added -optimizer adam and using the NIN model for lower memory gpus.

Here is my output that fails eventually...

RuntimeError: CUDA error: invalid device ordinal
NIN Architecture Detected
Successfully loaded models/nin_imagenet.pth
conv1: 96 3 11 11
cccp1: 96 96 1 1
cccp2: 96 96 1 1
conv2: 256 96 5 5
cccp3: 256 256 1 1
cccp4: 256 256 1 1
conv3: 384 256 3 3
cccp5: 384 384 1 1
cccp6: 384 384 1 1
conv4-1024: 1024 384 3 3
cccp7-1024: 1024 1024 1 1
cccp8-1024: 1000 1024 1 1
Traceback (most recent call last):
  File "/home/gateway/work/neural-style-software/neural-style-pt/neural_style.py", line 468, in <module>
    main()
  File "/home/gateway/work/neural-style-software/neural-style-pt/neural_style.py", line 157, in main
    net = setup_multi_device(net)
  File "/home/gateway/work/neural-style-software/neural-style-pt/neural_style.py", line 328, in setup_multi_device
    new_net = ModelParallel(net, params.gpu, params.multidevice_strategy)
  File "/home/gateway/work/neural-style-software/neural-style-pt/CaffeLoader.py", line 110, in __init__
    self.chunks = self.chunks_to_devices(self.split_net(net, device_splits.split(',')))
  File "/home/gateway/work/neural-style-software/neural-style-pt/CaffeLoader.py", line 134, in chunks_to_devices
    chunk.to(self.device_list[i])
  File "/home/gateway/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 426, in to
    return self._apply(convert)
  File "/home/gateway/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 202, in _apply
    module._apply(fn)
  File "/home/gateway/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 224, in _apply
    param_applied = fn(param)
  File "/home/gateway/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 424, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

https://github.com/ProGamerGov/neural-style-pt/blob/master/examples/scripts/starry_stanford.sh

Nvidia info

btw I'm using GPU 1 since it has the most memory and not using the primary display..

(base) gateway@gateway-media:~/work/neural-style-software/neural-style-pt$ nvidia-smi
Tue May  5 14:34:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   54C    P8     4W / 120W |    195MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:02:00.0 Off |                  N/A |
| 21%   56C    P8     6W / 180W |      2MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2161      G   /usr/lib/xorg/Xorg                           101MiB |
|    0      2661      G                                                 11MiB |
|    0      2856      G   /usr/bin/gnome-shell                          77MiB |
+-----------------------------------------------------------------------------+
(base) gateway@gateway-media:~/work/neural-style-software/neural-style-pt$ 

thoughts?

ProGamerGov commented 4 years ago

First you should check if PyTorch sees your devices correctly and that CUDA works. Try running this in the Python interpreter and seeing what it shows:

import torch
torch.__version__ # Get PyTorch and CUDA version
torch.cuda.is_available() # Check that CUDA works
torch.cuda.device_count() # Check how many CUDA capable devices you have

# Print device human readable names
torch.cuda.get_device_name(0)
torch.cuda.get_device_name(1)
# Add more lines with +1 like get_device_name(3), get_device_name(4) if you have more devices.

If the devices exist and CUDA works, then it's probably just an issue with the ID you are using. CUDA can sometimes be a bit weird with how it sets GPU IDs: https://stackoverflow.com/questions/13781738/how-does-cuda-assign-device-ids-to-gpus

You fix the GPU device order by CUDA_DEVICE_ORDER=PCI_BUS_ID before the command:

CUDA_DEVICE_ORDER=PCI_BUS_ID python3 neural_style.py

You can also use CUDA_VISIBLE_DEVICES before the command to make sure that PyTorch can only see the specified device:

# Only make GPU ID 1 visible to PyTorch
CUDA_VISIBLE_DEVICES=1 python3 neural_style.py
gateway commented 4 years ago

ahh.. never knew that about PyTorch, it seems that the device id's compared to what nvidia-smi are swapped.

>>> torch.cuda.get_device_name(0)
'GeForce GTX 1080'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1060 6GB'

hmm so in my case adding maybe this. CUDA_DEVICE_ORDER=0 python3 neural_style.py would be the 1060, and CUDA_DEVICE_ORDER=1 python3 neural_style.py should be the 1080?

should I make any changed to the GPU value in the script? Thanks for your timely response.. btw has anyone used your version of style transfer for video?

ProGamerGov commented 4 years ago

The invalid device ordinal is error is normally given when you specify a non existent GPU ID.

The GPU value in the script should be set to the PyTorch GPU ID that you want to use as PyTorch shows the device you want to use as having an ID of 0. The order and GPU values available to PyTorch will change based on the CUDA environment variables you specify.

CUDA_DEVICE_ORDER=PCI_BUS_ID will swap the GPU order if the existing order is not based on the PCI Bus order.

CUDA_VISIBLE_DEVICES=1 will make GPU 0 in PyTorch be your second GPU.

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 will only give PyTorch the second GPU device based on the PCI Bus order, but that second GPU will listed as GPU 0 so you'll need to use -gpu 0.

btw has anyone used your version of style transfer for video?

Yes, but those individuals tend to use techniques like rotoscoping to create video to avoid the flicking effect. I'm not knowledgeable enough yet to translate artistic-videos to PyTorch. But it should easier for someone who better understands the video aspect of the code, as both artistic-videos and neural-style-pt are based on the same original code (neural-style).

ProGamerGov commented 4 years ago

Basically this is what neural-style-pt does with GPU IDs (example with the Python Interpreter):

import torch
a = torch.randn(3)
a.to('cpu') # Puts tensor 'a' on the CPU if it wasn't already
a.to('cuda:0') # Puts tensor 'a' on device 0
a.to('cuda:1') # Puts tensor 'a' on device 1

When I specify a valid GPU, I get something like this:

>>> a.to('cuda:0')
tensor([ 0.8459, -0.2027,  0.6153], device='cuda:0')

And when I specify a GPU that doesn't exist on my computer, I get this:

>>> a.to('cuda:1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal