Closed juraev closed 1 year ago
Hi, @juraev I also met this problem twice. When I use batch_size=16, the code could be terminated in a few iterations. After I change the batch_size to 32, the problem never shows. And if you have trained for >5000 iters, you can retrain the model by loading the saved network weights .pth file with "--resume" in the command line. Hope this can help you.
Thanks @zhangxgu
I will have a try and get back to you :)
Hi, @zhangxgu I met the same problem when I was training models on LVIS dataset.
LVIS+Res50: the training and inference were successful, but terminated at the prediction:
[12/17 21:08:07] d2.evaluation.lvis_evaluation INFO: Evaluating predictions ...
[12/17 21:08:07] d2.evaluation.lvis_evaluation INFO: Evaluating with max detections per image = 300
(this is the end of log.txt)
LVIS+Res101/Swin: the training would freeze after few epochs
May I ask if you have alternative solutions as setting batch_size=32 needs a lot of GPU resources, thank you.
@pinecho I did not meet the first problem before. Maybe you can set the predefined proposals to 300 with"MODEL.DiffusionInst.NUM_PROPOSALS 300" in the inference command line. For the second problem, I use A100 which has 80G memory for bs=32. Alternatively, I think maybe you can change the random seed in dataloaders? I did not try this but in my case, all training freeze problem occurs in the same iteration. Changing the random seed may help this.
@zhangxgu Thanks for your swift reply!
Hello. When I am training on the COCO dataset, the training is freezing after some iterations and getting terminated after reaching some timeout.
How to resolve this issue?