Closed yikaiw closed 4 years ago
Very nice repo! Hope for more models (like refinenet) and more datasets (like SUNRGBD and NYUDV2). A problem is: my cuda memory is ok if I train without loading a model. However, when training is stopped, and I set args.resume to the current checkpoint for continue training, cuda will be out of memory. The config file I use is cityscapes_deeplabv3_plus.yaml. I don't know why.
what is your environment? i can not reproduce your problem in my p40 machine when batch size is 4 of a single gpu.
I will add refinenet soon.
Thanks for your reply. I run the code on 8 RTX 2080 (8 * 10989MiB). Due to the cuda memory, I have to modify batch size to 2 (per gpu). For two runnings, the code always stops at a particular iter: Epoch: 212/400 || Iters: 35/185 || Lr: 0.010177 || Loss: 0.0906 || Cost Time: 10:07:39 || Estimated Time: 9:03:16. And if I resume from the last checkpoint 211.pt, cuda gets out of memory. My command line is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh configs/cityscapes_deeplabv3_plus.yaml 8
Thanks for your reply. I run the code on 8 RTX 2080 (8 * 10989MiB). Due to the cuda memory, I have to modify batch size to 2 (per gpu). For two runnings, the code always stops at a particular iter: Epoch: 212/400 || Iters: 35/185 || Lr: 0.010177 || Loss: 0.0906 || Cost Time: 10:07:39 || Estimated Time: 9:03:16. And if I resume from the last checkpoint 211.pt, cuda gets out of memory. My command line is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh configs/cityscapes_deeplabv3_plus.yaml 8
ok, i will try to reproduce it first. Maybe it is difficult to reproduce your problem, because i do not have 11GB gpu machine.
Very nice repo! Hope for more models (like refinenet) and more datasets (like SUNRGBD and NYUDV2). A problem is: my cuda memory is ok if I train without loading a model. However, when training is stopped, and I set args.resume to the current checkpoint for continue training, cuda will be out of memory. The config file I use is cityscapes_deeplabv3_plus.yaml. I don't know why.