LikeLy-Journey / SegmenTron

Support PointRend, Fast_SCNN, HRNet, Deeplabv3_plus(xception, resnet, mobilenet), ContextNet, FPENet, DABNet, EdaNet, ENet, Espnetv2, RefineNet, UNet, DANet, HRNet, DFANet, HardNet, LedNet, OCNet, EncNet, DuNet, CGNet, CCNet, BiSeNet, PSPNet, ICNet, FCN, deeplab)
Apache License 2.0
695 stars 162 forks source link

cuda out of memory if set args.resume to some model #11

Closed yikaiw closed 4 years ago

yikaiw commented 4 years ago

Very nice repo! Hope for more models (like refinenet) and more datasets (like SUNRGBD and NYUDV2). A problem is: my cuda memory is ok if I train without loading a model. However, when training is stopped, and I set args.resume to the current checkpoint for continue training, cuda will be out of memory. The config file I use is cityscapes_deeplabv3_plus.yaml. I don't know why.

LikeLy-Journey commented 4 years ago

Very nice repo! Hope for more models (like refinenet) and more datasets (like SUNRGBD and NYUDV2). A problem is: my cuda memory is ok if I train without loading a model. However, when training is stopped, and I set args.resume to the current checkpoint for continue training, cuda will be out of memory. The config file I use is cityscapes_deeplabv3_plus.yaml. I don't know why.

what is your environment? i can not reproduce your problem in my p40 machine when batch size is 4 of a single gpu.
I will add refinenet soon.

yikaiw commented 4 years ago

Thanks for your reply. I run the code on 8 RTX 2080 (8 * 10989MiB). Due to the cuda memory, I have to modify batch size to 2 (per gpu). For two runnings, the code always stops at a particular iter: Epoch: 212/400 || Iters: 35/185 || Lr: 0.010177 || Loss: 0.0906 || Cost Time: 10:07:39 || Estimated Time: 9:03:16. And if I resume from the last checkpoint 211.pt, cuda gets out of memory. My command line is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh configs/cityscapes_deeplabv3_plus.yaml 8

LikeLy-Journey commented 4 years ago

Thanks for your reply. I run the code on 8 RTX 2080 (8 * 10989MiB). Due to the cuda memory, I have to modify batch size to 2 (per gpu). For two runnings, the code always stops at a particular iter: Epoch: 212/400 || Iters: 35/185 || Lr: 0.010177 || Loss: 0.0906 || Cost Time: 10:07:39 || Estimated Time: 9:03:16. And if I resume from the last checkpoint 211.pt, cuda gets out of memory. My command line is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh configs/cityscapes_deeplabv3_plus.yaml 8

ok, i will try to reproduce it first. Maybe it is difficult to reproduce your problem, because i do not have 11GB gpu machine.