Sierkinhane / CRNN_Chinese_Characters_Rec

(CRNN) Chinese Characters Recognition.
1.82k stars 538 forks source link

Runtime Error cuda out of memory occurs while the gpu memory is empty #34

Closed sagrawal1993 closed 5 years ago

sagrawal1993 commented 5 years ago

Detailed error description::

Traceback (most recent call last):
File "crnn_main.py", line 193, in
training()
File "crnn_main.py", line 110, in training
cost = trainBatch(crnn, criterion, optimizer, train_iter)
File "crnn_main.py", line 96, in trainBatch
cost = criterion(preds, text, preds_size, length) / batch_size
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/init.py", line 82, in forward
self.length_average, self.blank)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/init.py", line 32, in forward
blank)
File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(
args, **kwargs)
torch.FatalError: CUDA error: out of memory (allocate at /pytorch/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCudaMalloc + 0x79 (0x7f50f7b32e99 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: gpu_ctc + 0x134 (0x7f50f61f92a4 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorch/_warp_ctc/$ _warp_ctc.cpython-35m-x86_64-linux-gnu.so)
frame #2: + 0x1ad2 (0x7f50f61f8ad2 in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/warpctc_pytorch-0.1-py3.5-linux-x86_64.egg/warpctc_pytorc$ /_warp_ctc/__warp_ctc.cpython-35m-x86_64-linux-gnu.so)

frame #5: THPModule_safeCall(_object*, _object*, _object*) + 0x4c (0x7f511e7a67cc in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-l$ nux-gnu.so) frame #8: python() [0x5401ef] frame #11: python() [0x4ec358] frame #14: THPFunction_apply(_object*, _object*) + 0x38f (0x7f511eb9383f in /home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) frame #18: python() [0x4ec3f7] frame #22: python() [0x4ec2e3] frame #24: python() [0x4fbfce] frame #26: python() [0x574db6] frame #31: python() [0x53fc97] frame #33: python() [0x60cb42] frame #38: __libc_start_main + 0xf0 (0x7f513430a830 in /lib/x86_64-linux-gnu/libc.so.6) Exception ignored in: > Traceback (most recent call last): File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in __del__ self._shutdown_workers() File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/home/ubuntu/suraj/TrainModel/venv/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused I am using :: cuda: 8.0 python: 3.5 pytourch : 0.4.1 I am getting error while using cuda. It is running fine on cpu.
haneSier commented 5 years ago

Reduce the batch size

sagrawal1993 commented 5 years ago

Thanks @haneSier . I reduce the batch size, current batch size was 16, while I even reduced it to 1, it is giving the same error.

I have 12GB gpu memory and the gpu is Tesla K80.

Sierkinhane commented 5 years ago

Using the command “df -h /dev/shm” can view the state of shared memory. Increasing the size of shared memory may solve the problem. If not, try to set the num_workers=0(in params.py).

sagrawal1993 commented 5 years ago

thanks @Sierkinhane I have watch the shared memory, It is filled in MB, while 30GB is there. I even change workers=0 in params.py, but has the same problem.

loralyc commented 5 years ago

@sagrawal1993 hello, would you already solve the problem? i am in same trouble as you. would you give me some suggestion?

kaixin-bai commented 5 years ago

save problem, do you have solution already?