SysCV / transfiner

Mask Transfiner for High-Quality Instance Segmentation, CVPR 2022
https://www.vis.xyz/pub/transfiner
Apache License 2.0
534 stars 61 forks source link

The out of memory problem #16

Closed XiaoyuZHK closed 2 years ago

XiaoyuZHK commented 2 years ago

Hi, Thanks for your work. I am unfamiliar with detectron2, henceing this issue. I use the same GPU - NVIDIA RTX 2080 Ti (only one piece). However, when I tried the training, there always have the out of memory problem, so I tried to change the batch size to 1, the problem still have. until: use MIN_SIZE_TRAIN: (100,), MAX_SIZE_TRAIN: 200 The training can complete.

So I guess there's something wrong here, but I want to know why and how to fix it.

For example: dataset:coco

The changed config:(mask_rcnn_R_50_FPN_1x_4gpu_transfiner.yaml -- Base-RCNN-FPN-4gpu.yaml) SOLVER: IMS_PER_BATCH: 1 # 8 # 16 BASE_LR: 0.0025 # 0.02 STEPS: (60000, 80000) MAX_ITER: 90000 INPUT:

MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)

MIN_SIZE_TRAIN: (200,) MAX_SIZE_TRAIN: 300 DATALOADER: NUM_WORKERS: 1

The log: [04/27 13:03:34 d2.utils.events]: eta: 4:00:48 iter: 319 total_loss: 2.127 loss_cls: 0.1886 loss_box_reg: 0.1425 loss_mask: 0.3233 loss_mask_uncertain: 0.6131 loss_mask_refine: 0.4374 loss_semantic: 0.08179 loss_rpn_cls: 0.1416 loss_rpn_loc: 0.04266 time: 0.1666 data_time: 0.0013 lr: 0.0007992 max_mem: 8277M [04/27 13:03:39 d2.utils.events]: eta: 4:01:09 iter: 339 total_loss: 2.429 loss_cls: 0.1466 loss_box_reg: 0.1322 loss_mask: 0.3433 loss_mask_uncertain: 0.588 loss_mask_refine: 0.4261 loss_semantic: 0.1083 loss_rpn_cls: 0.1618 loss_rpn_loc: 0.1142 time: 0.1714 data_time: 0.0011 lr: 0.00084915 max_mem: 8584M [04/27 13:03:45 d2.utils.events]: eta: 4:03:05 iter: 359 total_loss: 1.859 loss_cls: 0.1425 loss_box_reg: 0.136 loss_mask: 0.2608 loss_mask_uncertain: 0.5863 loss_mask_refine: 0.408 loss_semantic: 0.08011 loss_rpn_cls: 0.131 loss_rpn_loc: 0.02407 time: 0.1784 data_time: 0.0012 lr: 0.0008991 max_mem: 8584M [04/27 13:03:52 d2.utils.events]: eta: 4:04:25 iter: 379 total_loss: 2.212 loss_cls: 0.2625 loss_box_reg: 0.1868 loss_mask: 0.2859 loss_mask_uncertain: 0.576 loss_mask_refine: 0.4269 loss_semantic: 0.1137 loss_rpn_cls: 0.1371 loss_rpn_loc: 0.07165 time: 0.1867 data_time: 0.0014 lr: 0.00094905 max_mem: 8584M ERROR [04/27 13:03:54 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/defaults.py", line 493, in run_step self._trainer.run_step() File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/metaarch/rcnn.py", line 172, in forward , detector_losses = self.roi_heads(images, features, proposals, gt_instances) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/roi_heads.py", line 521, in forward losses.update(self._forward_mask(features, proposals)) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/roi_heads.py", line 677, in _forward_mask return self.mask_head(features, instances) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 656, in forward x, x_uncertain, x_hr, x_hr_l, x_hr_ll, x_c, x_p2_s, encoder, instances, self.vis_period) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 584, in mask_rcnn_loss select_box_feats_cat, select_box_feats_cat_pos).permute(1, 2, 0).unsqueeze(-1) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 1056, in forward output = layer(output, pos) #encoder File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 1031, in forward src2 = self.linear2(self.dropout(self.activation(self.linear1(src)))) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 58, in forward return F.dropout(input, self.p, self.training, self.inplace) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/functional.py", line 1076, in dropout return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training) RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 10.76 GiB total capacity; 8.30 GiB already allocated; 218.69 MiB free; 8.64 GiB reserved in total by PyTorch) [04/27 13:03:54 d2.engine.hooks]: Overall training speed: 384 iterations in 0:01:13 (0.1903 s / it) [04/27 13:03:54 d2.engine.hooks]: Total training time: 0:01:13 (0:00:00 on hooks) [04/27 13:03:54 d2.utils.events]: eta: 4:05:02 iter: 386 total_loss: 2.723 loss_cls: 0.2625 loss_box_reg: 0.247 loss_mask: 0.305 loss_mask_uncertain: 0.5775 loss_mask_refine: 0.4294 loss_semantic: 0.1251 loss_rpn_cls: 0.179 loss_rpn_loc: 0.1717 time: 0.1897 data_time: 0.0014 lr: 0.00096404 max_mem: 8584M Traceback (most recent call last): File "tools/train_net.py", line 169, in args=(args,), File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/launch.py", line 82, in launch main_func(args) File "tools/train_net.py", line 157, in main return trainer.train() File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/defaults.py", line 483, in train super().train(self.start_iter, self.max_iter) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/defaults.py", line 493, in run_step self._trainer.run_step() File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/engine/train_loop.py", line 273, in run_step loss_dict = self.model(data) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/metaarch/rcnn.py", line 172, in forward , detector_losses = self.roi_heads(images, features, proposals, gt_instances) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/roi_heads.py", line 521, in forward losses.update(self._forward_mask(features, proposals)) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/roi_heads.py", line 677, in _forward_mask return self.mask_head(features, instances) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 656, in forward x, x_uncertain, x_hr, x_hr_l, x_hr_ll, x_c, x_p2_s, encoder, instances, self.vis_period) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 584, in mask_rcnn_loss select_box_feats_cat, select_box_feats_cat_pos).permute(1, 2, 0).unsqueeze(-1)
File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 1056, in forward output = layer(output, pos) #encoder File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/media/huang/5474B47974B45F82/zhk/transfiner/detectron2/modeling/roi_heads/mask_head.py", line 1031, in forward src2 = self.linear2(self.dropout(self.activation(self.linear1(src)))) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/modules/dropout.py", line 58, in forward return F.dropout(input, self.p, self.training, self.inplace) File "/home/huang/anaconda3/envs/transfier/lib/python3.7/site-packages/torch/nn/functional.py", line 1076, in dropout return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training) RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 10.76 GiB total capacity; 8.30 GiB already allocated; 218.69 MiB free; 8.64 GiB reserved in total by PyTorch)

Looking forward to your reply. :)

lkeab commented 2 years ago

If you have a limited GPU memory during training, you can lower down the limit parameter here and here, while keeping them the same. For example, change to 30.

XiaoyuZHK commented 2 years ago

Thanks for your reply!! It works!! :>