RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing

bahetibhakti commented 5 years ago

Is anyone able to test the code for "DRN-D-105" architecture on test data?? I am able to train and validate but while testing error occurs as "RuntimeError: CUDA error: out of memory" even with small crop size = 256*256 and batchsize =1. I checked resources while testing and resources are free enough (both GPU memory and system RAM) I am using NVIDIA P100 GPU with 16 GB memory.

Any thought?

(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 Namespace(arch='drn_d_105', batch_size=1, bn_sync=False, classes=26, cmd='test', crop_size=896, data_dir='dataset/', epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode='step', momentum=0.9, ms=False, phase='test', pretrained='', random_rotate=0, random_scale=0, resume='model_best.pth.tar', step=200, test_suffix='', weight_decay=0.0001, with_gt=False, workers=2) classes : 26 batch_size : 1 pretrained : momentum : 0.9 with_gt : False phase : test list_dir : None lr_mode : step weight_decay : 0.0001 epochs : 10 step : 200 bn_sync : False ms : False arch : drn_d_105 random_rotate : 0 random_scale : 0 workers : 2 crop_size : 896 lr : 0.01 load_rel : None resume : model_best.pth.tar evaluate : False cmd : test data_dir : dataset/ test_suffix : [2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint 'model_best.pth.tar' [2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint 'model_best.pth.tar' (epoch 1) segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. image_var = Variable(image, requires_grad=False, volatile=True) Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>> Traceback (most recent call last): File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py", line 337, in get return ForkingPickler.loads(res) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle return recvfds(s, 1)[0] File "/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py", line 152, in recvfds msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size)) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "segment.py", line 789, in main() File "segment.py", line 785, in main test_seg(args) File "segment.py", line 720, in test_seg has_gt=phase != 'test' or args.with_gt, output_dir=out_dir) File "segment.py", line 544, in test final = model(image_var)[0] File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "segment.py", line 142, in forward y = self.up(x) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 691, in forward output_padding, self.groups, self.dilation) RuntimeError: CUDA error: out of memory

jwzhi commented 4 years ago

same issue

lyxbyr commented 4 years ago

this is bug???

lyxbyr commented 4 years ago

same issue,could you solve it?

raven38 commented 4 years ago

Because the crop_size argument is disabled while testing. The argument is enabled while only training. Please refer https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L632-L640 and https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L360-L383

taesungp commented 4 years ago

It's because the new pytorch deprecated volatile, which was used to disable gradient recording. The new recommended way is using torch.no_grad().

In the last line of segment.py, wrap main() with with torch.no_grad():

if __name__ == "__main__":
    with torch.no_grad():
        main()

fyu / drn

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing #49