Closed kjgonzalez closed 5 years ago
@m3rcury6 Our network was trained in batch size 12 with 4 12GB GPU. A pair of images in size 256x512 consumed about 4GB GPU memory. You can modify the batch size depending on your device. My suggestion of batch size is 4~16.
Neither 4 nor 16 worked for me. i'm also trying to play with the number of workers, but nothing's helping. i have two GPU's, CUDA 9.0 installed, if that helps. also, i've suppressed the user warnings by specifying "align_corners=False" in "stackhourglass.py" and "submodule.py", not sure if that's explicitly the original intent. Warning was:
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:1749: UserWarning:
Default upsampling behavior when mode=trilinear is changed to align_corners=False
since 0.4.0. Please specify align_corners=True if the old behavior is desired. See
the documentation of nn.Upsample for details.
anyway, i changed this while trying to train:
TrainImgLoader = torch.utils.data.DataLoader(
DA.myImageFloder(all_left_img,all_right_img,all_left_disp, True),
batch_size= 4, shuffle= True, num_workers= 0, drop_last=False)
TestImgLoader = torch.utils.data.DataLoader(
DA.myImageFloder(test_left_img,test_right_img,test_left_disp, False),
batch_size= 4, shuffle= False, num_workers= 0, drop_last=False)
and i get error:
# python main.py --maxdisp 192 --model stackhourglass --datapath
/shared_data/SCENEFLOW_FREIBURG/ --epochs 10
Number of model parameters: 5224768
This is 1-th epoch
0.001
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 :
out of memory
Traceback (most recent call last):
File "main.py", line 193, in <module>
main()
File "main.py", line 161, in main
loss = train(imgL_crop,imgR_crop, disp_crop_L)
File "main.py", line 104, in train
output1, output2, output3 = model(imgL,imgR)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/
module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/aserver/attempt2_psmnet/PSMNet/models/
stackhourglass.py", line 152, in forward
pred3 = F.softmax(cost3,dim=1)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/
functional.py", line 862, in softmax
return torch._C._nn.softmax(input, dim)
RuntimeError: cuda runtime error (2) : out of memory
at /pytorch/aten/src/THC/generic/THCStorage.cu:58
is there a way to run this network for some kind of "low end" settings or something?
hey, just wanted to post an update. i had accidentally installed torchvision 0.2.2, which i finally realized might have been affecting training. i installed 0.2.0. additionally, i set the batch_size to 4 and num_workers to 0 in the main.py file. i finally tried everything over again, and it seems to have worked! thanks for the replies, seems like they were mistakes on my side.
I'm having trouble getting the code running because i keep running into the same error. The raw input / output is as follows:
I have trained other networks before on this system successfully, so I know it has to do with how the work is divided and put onto the GPU's. I'm new at this part... so how can I select a batch_size, num_workers, and other parameters to try and make this training require less memory at once? Sorry, I'm not really sure of what I'm asking, I just want to avoid this code crashing because of out-of-memory issues
edit: thanks in advance