JiaquanYe / TableMASTER-mmocr

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.
Apache License 2.0
442 stars 104 forks source link

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 31.75 GiB total capacity; 27.05 GiB already allocated; 18.75 MiB free; 27.92 GiB reserved in total by PyTorch) #10

Closed Ycxyue closed 1 year ago

Ycxyue commented 3 years ago

请问当我跑完了第一轮开始验证时,报错显存不够,但是有保存到第一轮的训练模型,是在训练完第一轮后加载验证集时产生的错误;验证时的batch_size=1;mmocr==0.2.0;mmdet==2.16.0;mmcv-full==1.3.12; 如果将验证设置为false,则可以跑完全部的17个epoch;

JiaquanYe commented 3 years ago

请问当我跑完了第一轮开始验证时,报错显存不够,但是有保存到第一轮的训练模型,是在训练完第一轮后加载验证集时产生的错误;验证时的batch_size=1;mmocr==0.2.0;mmdet==2.16.0;mmcv-full==1.3.12; 如果将验证设置为false,则可以跑完全部的17个epoch;

Could you show the complete error report ?

Ycxyue commented 3 years ago

非常感谢! 我是先训练表结构提取的部分,具体错误如下: """ 2021-09-06 06:42:22,383 - mmocr - INFO - Epoch [1][13/13] lr: 4.933e-04, eta: 0:02:28, time: 0.457, data_time: 0.005, memory: 12909, loss_ce: 3.1415, horizon_bbox_loss: 0.4799, vertical_bbox_loss: 0.6473, loss: 4.2686, grad_norm: 58.6805 2021-09-06 06:42:22,418 - mmocr - INFO - Saving checkpoint at 1 epochs [ ] 0/99, elapsed: 0s, ETA:Traceback (most recent call last): File "./tools/train.py", line 228, in main() File "./tools/train.py", line 224, in main meta=meta) File "form_recognition/TableMASTER-mmocr-master/mmocr/apis/train.py", line 156, in train_detector runner.run(data_loaders, cfg.workflow) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "form_recognition/TableMASTER-mmocr-master/mmdetection-2.11.0/mmdet/core/evaluation/eval_hooks.py", line 146, in after_train_epoch results = single_gpu_test(runner.model, self.dataloader, show=False) File "form_recognition/TableMASTER-mmocr-master/mmdetection-2.11.0/mmdet/apis/test.py", line 27, in single_gpu_test result = model(return_loss=False, rescale=True, data) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 42, in forward return super().forward(*inputs, *kwargs) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward return self.module(inputs[0], kwargs[0]) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func output = old_func(*new_args, new_kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/recognizer/base.py", line 107, in forward return self.forward_test(img, img_metas, kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/recognizer/base.py", line 85, in forward_test return self.simple_test(imgs, img_metas, *kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/recognizer/table_master.py", line 132, in simple_test feat, out_enc, None, img_metas, train_mode=False) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 327, in forward return self.forward_test(feat, out_enc, img_metas) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 314, in forward_test output, bbox_output = self.greedy_forward(SOS, out_enc, src_mask) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 287, in greedy_forward out, bbox_output = self.decode(input, feature, None, target_mask) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 268, in decode x = layer(x, feature, src_mask, tgt_mask) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 112, x = self.sublayer[1](x, lambda x: self.src_attn(x, feature, feature, src_mask)) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 44, in forward return x + self.dropout(sublayer(self.norm(x))) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 112, in x = self.sublayer[1](x, lambda x: self.src_attn(x, feature, feature, src_mask)) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 93, in forward x, self.attn = self_attention(query, key, value, mask=mask, dropout=self.dropout) File "form_recognition/TableMASTER-mmocr-master/mmocr/models/textrecog/decoders/master_decoder.py", line 68, in self_attention p_attn = F.softmax(score, dim=-1) File "/root/anaconda3/envs/table_master/lib/python3.7/site-packages/torch/nn/functional.py", line 1512, in softmax ret = input.softmax(dim) RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 1; 31.75 GiB total capacity; 13.39 GiB already allocated; 4.75 MiB free; 13.83 GiB reserved in total by PyTorch) """ 这是第一轮后报错; 将'validate=(not args.no_validate)'设置为'validate=False'后已经跑完了17轮; 单独测试了一下表结构提取的效果,还是可以的; 有人说是mmocr的版本问题,已经使用过0.3.0,0.2.0,但均会报错;

FangLi1 commented 3 years ago

I faced the same issue. Can someone help me?