cuda runtime error, how to reduce load on GPU's?

kjgonzalez commented 5 years ago

I'm having trouble getting the code running because i keep running into the same error. The raw input / output is as follows:

# python main.py --maxdisp 192 --model stackhourglass --datapath /shared_data/SCENEFLOW_FREIBURG/ --epochs 10
python version: 2
Number of model parameters: 5224768
This is 1-th epoch
0.001
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/mnt/data/var/lib/docker/overlay2/l/66ASUNQRQUJZJUJE324A5N5YXR:/mnt/data/var/lib/docker/overlay2/l/PPZE25BSOHUVR24TDTHJJLQQEM:/mnt/data/var/lib/docker/overlay2/l/MMNCJYFH3HFM4MNTZENWTRRUL5:/mnt/data/var/lib/docker/overlay2/l/LL4JH2BPBFOIRYIF4YA4SPVIPU:/mnt/data/var/lib/docker/overlay2/l/3KAWTOWMA7DRYT6PMP66LRLZM7:/mnt/data/var/lib/docker/overlay2/l/EIZBWFIMKT2PI4OMJXIPPFPUNI:/mnt/data/var/lib/docker/overlay2/l/ANJDS3VUTVVEE34ENSIXNJK5WE:/mnt/data/var/lib/docker/overla'
Unexpected end of /proc/mounts line `y2/l/EIC2TXFB6MMZQGSRMNPSTYHXZQ:/mnt/data/var/lib/docker/overlay2/l/TI6LGQOV3XS7PQ3JO2BLZNMWBE:/mnt/data/var/lib/docker/overlay2/l/2JQMV7CDAVQ26S6XGOL2TJBQPP:/mnt/data/var/lib/docker/overlay2/l/ZT7U22WPZSY3ZE7MRF43ET6KOL:/mnt/data/var/lib/docker/overlay2/l/VZ4R4OSD25MIX6VEURORTMGTBX:/mnt/data/var/lib/docker/overlay2/l/OX3NKE56ASIG3WC4W2HMH27ALX:/mnt/data/var/lib/docker/overlay2/l/EQV6XKOD5FJCHDDPG7BZX6LPOX:/mnt/data/var/lib/docker/overlay2/l/ROFFM5BAYGY2RLS67KMIC3CVRL:/mnt/data/var/lib/docker/overlay2/l/SE'
Unexpected end of /proc/mounts line `GWNTDOLPVCPGE45MKM4GS5VQ:/mnt/data/var/lib/docker/overlay2/l/XQEAFLJ7Y2GQCB5SZB234KID6E:/mnt/data/var/lib/docker/overlay2/l/HAG6RA2CLD6RWKYGOORLYZJFWR:/mnt/data/var/lib/docker/overlay2/l/LK22WZJ5EXJG6PCCD72ZJTSPFG:/mnt/data/var/lib/docker/overlay2/l/F6W52IVZE3MORZASBRZOQ7HWSH:/mnt/data/var/lib/docker/overlay2/l/HKM27DGHTCM3WHJIRNGM4RB7RX:/mnt/data/var/lib/docker/overlay2/l/CCYEBNDNOFDHCRZTMSQUE3GGW3:/mnt/data/var/lib/docker/overlay2/l/E5PMT6FEIQ3LXSU3R6GPEHABZE:/mnt/data/var/lib/docker/overlay2/l/3AKJRTSXO'
Unexpected end of /proc/mounts line `R25KD2TXZMNGJUIHC:/mnt/data/var/lib/docker/overlay2/l/5SCM43I6H7VBLV2IYPL7B4X5XZ:/mnt/data/var/lib/docker/overlay2/l/3PPOBNA7RZRCZDHH4UQNPXQVNP:/mnt/data/var/lib/docker/overlay2/l/DT3KI5RXZ62LBCMZ7LQQFRJ2TT:/mnt/data/var/lib/docker/overlay2/l/YMURBBKSQXOLP2AIVSZWMBFX2F:/mnt/data/var/lib/docker/overlay2/l/PNHGQ5WV7A5XZT4BFJTV5BGCB4:/mnt/data/var/lib/docker/overlay2/l/KN75YRWKHT7UK6HQXLKCXXEMIC:/mnt/data/var/lib/docker/overlay2/l/ZHKPUZSVKBMG2V3FGSIOGAT2MW:/mnt/data/var/lib/docker/overlay2/l/QZS4C53ME7JMVUR2'
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fe6f7730190>> ignored
Traceback (most recent call last):
  File "main.py", line 184, in <module>
    main()
  File "main.py", line 152, in main
    loss = train(imgL_crop,imgR_crop, disp_crop_L)
  File "main.py", line 95, in train
    output1, output2, output3 = model(imgL,imgR)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
    raise output
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I have trained other networks before on this system successfully, so I know it has to do with how the work is divided and put onto the GPU's. I'm new at this part... so how can I select a batch_size, num_workers, and other parameters to try and make this training require less memory at once? Sorry, I'm not really sure of what I'm asking, I just want to avoid this code crashing because of out-of-memory issues

edit: thanks in advance

JiaRenChang commented 5 years ago

@m3rcury6 Our network was trained in batch size 12 with 4 12GB GPU. A pair of images in size 256x512 consumed about 4GB GPU memory. You can modify the batch size depending on your device. My suggestion of batch size is 4~16.

kjgonzalez commented 5 years ago

Neither 4 nor 16 worked for me. i'm also trying to play with the number of workers, but nothing's helping. i have two GPU's, CUDA 9.0 installed, if that helps. also, i've suppressed the user warnings by specifying "align_corners=False" in "stackhourglass.py" and "submodule.py", not sure if that's explicitly the original intent. Warning was:

/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:1749: UserWarning: 
Default upsampling behavior when mode=trilinear is changed to align_corners=False 
since 0.4.0. Please specify align_corners=True if the old behavior is desired. See 
the documentation of nn.Upsample for details.

anyway, i changed this while trying to train:

TrainImgLoader = torch.utils.data.DataLoader(
         DA.myImageFloder(all_left_img,all_right_img,all_left_disp, True),
         batch_size= 4, shuffle= True, num_workers= 0, drop_last=False)

TestImgLoader = torch.utils.data.DataLoader(
         DA.myImageFloder(test_left_img,test_right_img,test_left_disp, False),
         batch_size= 4, shuffle= False, num_workers= 0, drop_last=False)

and i get error:

# python main.py --maxdisp 192 --model stackhourglass --datapath 
/shared_data/SCENEFLOW_FREIBURG/ --epochs 10
Number of model parameters: 5224768
This is 1-th epoch
0.001
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : 
out of memory
Traceback (most recent call last):
  File "main.py", line 193, in <module>
    main()
  File "main.py", line 161, in main
    loss = train(imgL_crop,imgR_crop, disp_crop_L)
  File "main.py", line 104, in train
    output1, output2, output3 = model(imgL,imgR)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/
module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/aserver/attempt2_psmnet/PSMNet/models/
stackhourglass.py", line 152, in forward
    pred3 = F.softmax(cost3,dim=1)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/
functional.py", line 862, in softmax
    return torch._C._nn.softmax(input, dim)
RuntimeError: cuda runtime error (2) : out of memory 
at /pytorch/aten/src/THC/generic/THCStorage.cu:58

is there a way to run this network for some kind of "low end" settings or something?

kjgonzalez commented 5 years ago

hey, just wanted to post an update. i had accidentally installed torchvision 0.2.2, which i finally realized might have been affecting training. i installed 0.2.0. additionally, i set the batch_size to 4 and num_workers to 0 in the main.py file. i finally tried everything over again, and it seems to have worked! thanks for the replies, seems like they were mistakes on my side.

JiaRenChang / PSMNet

cuda runtime error, how to reduce load on GPU's? #115