Closed sagrawal1993 closed 5 years ago
Reduce the batch size
Thanks @haneSier . I reduce the batch size, current batch size was 16, while I even reduced it to 1, it is giving the same error.
I have 12GB gpu memory and the gpu is Tesla K80.
Using the command “df -h /dev/shm” can view the state of shared memory. Increasing the size of shared memory may solve the problem. If not, try to set the num_workers=0(in params.py).
thanks @Sierkinhane I have watch the shared memory, It is filled in MB, while 30GB is there. I even change workers=0 in params.py, but has the same problem.
@sagrawal1993 hello, would you already solve the problem? i am in same trouble as you. would you give me some suggestion?
save problem, do you have solution already?
Detailed error description::
Traceback (most recent call last): + 0x1ad2 (0x7f50f61f8ad2 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorc$
/_warp_ctc/__warp_ctc.cpython-35m-x86_64-linux-gnu.so)
File "crnn_main.py", line 193, in
training()
File "crnn_main.py", line 110, in training
cost = trainBatch(crnn, criterion, optimizer, train_iter)
File "crnn_main.py", line 96, in trainBatch
cost = criterion(preds, text, preds_size, length) / batch_size
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/init.py", line 82, in forward
self.length_average, self.blank)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/init.py", line 32, in forward
blank)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(args, **kwargs)
torch.FatalError: CUDA error: out of memory (allocate at /pytorch/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCudaMalloc + 0x79 (0x7f50f7b32e99 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: gpu_ctc + 0x134 (0x7f50f61f92a4 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/_warp_ctc/$ _warp_ctc.cpython-35m-x86_64-linux-gnu.so)
frame #2: