dearleiii / PIRM-2018-SISR-Challenge

Super Resolution
https://www.pirm2018.org/PIRM-SR.html
2 stars 0 forks source link

DataParallel for multiple GPUs #7

Closed dearleiii closed 6 years ago

dearleiii commented 6 years ago

leichen@gpu-compute6$ export CUDA_VISIBLE_DEVICES=0,1 leichen@gpu-compute6$ python3 save_model.py Let's use: 2 GPUs! THCudaCheck FAIL file=/pytorch/aten/src/THC/THCTensorRandom.cu line=25 error=2 : out of memory Traceback (most recent call last): File "save_model.py", line 34, in approximator = nn.DataParallel(approximator) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 102, in init _check_balance(self.device_ids) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 17, in _check_balance dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 17, in dev_props = [torch.cuda.get_device_properties(i) for i in device_ids] File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/cuda/init.py", line 290, in get_device_properties init() # will define _get_device_properties and _CudaDeviceProperties File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/cuda/init.py", line 143, in init _lazy_init() File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/cuda/init.py", line 161, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

dearleiii commented 6 years ago

same error

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "save_model.py", line 36, in approximator.to(device) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 393, in to return self._apply(lambda t: t.to(device)) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182, in _apply param.data = fn(param.data) File "/home/home2/leichen/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 393, in return self._apply(lambda t: t.to(device)) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58 leichen@gpu-compute6$

dearleiii commented 6 years ago

Solution code:

device = torch.device("cuda:0") gpu_list = list(range(0, torch.cuda.device_count())) approximator = torch.nn.DataParallel(APXM_edsr(), device_ids=[0, 1, 2, 3, 4]) print("cuda.current_device=", torch.cuda.current_device()) print(approximator)

dearleiii commented 6 years ago

Data parallel across GPU is working

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1356 C /usr/bin/cuda_sensor 60MiB | | 0 9171 C python3 305MiB | | 0 10332 C python3 2225MiB | | 0 29879 C python3 82MiB | | 0 29955 C python3 305MiB | | 1 1356 C /usr/bin/cuda_sensor 60MiB | | 1 10140 C python3 305MiB | | 1 10332 C python3 688MiB | | 2 1356 C /usr/bin/cuda_sensor 60MiB | | 2 10332 C python3 688MiB | | 3 1356 C /usr/bin/cuda_sensor 60MiB | | 3 10332 C python3 143MiB | | 4 1356 C /usr/bin/cuda_sensor 60MiB | | 5 1356 C /usr/bin/cuda_sensor 60MiB | | 6 1356 C /usr/bin/cuda_sensor 60MiB | | 7 1356 C /usr/bin/cuda_sensor 60MiB | +-----------------------------------------------------------------------------+