Training is freezing - Githubissues

chenhaoxing / DiffusionInst

This repo is the code of paper "DiffusionInst: Diffusion Model for Instance Segmentation" (ICASSP'24).

Apache License 2.0

222 stars 14 forks source link

Training is freezing #5

Closed juraev closed 1 year ago

juraev commented 1 year ago

Hello. When I am training on the COCO dataset, the training is freezing after some iterations and getting terminated after reaching some timeout.

How to resolve this issue?

zhangxgu commented 1 year ago

Hi, @juraev I also met this problem twice. When I use batch_size=16, the code could be terminated in a few iterations. After I change the batch_size to 32, the problem never shows. And if you have trained for >5000 iters, you can retrain the model by loading the saved network weights .pth file with "--resume" in the command line. Hope this can help you.

juraev commented 1 year ago

Thanks @zhangxgu

I will have a try and get back to you :)

pinecho commented 1 year ago

Hi, @zhangxgu I met the same problem when I was training models on LVIS dataset.

LVIS+Res50: the training and inference were successful, but terminated at the prediction:

[12/17 21:08:07] d2.evaluation.lvis_evaluation INFO: Evaluating predictions ...
[12/17 21:08:07] d2.evaluation.lvis_evaluation INFO: Evaluating with max detections per image = 300 
(this is the end of log.txt)

LVIS+Res101/Swin: the training would freeze after few epochs

May I ask if you have alternative solutions as setting batch_size=32 needs a lot of GPU resources, thank you.

zhangxgu commented 1 year ago

@pinecho I did not meet the first problem before. Maybe you can set the predefined proposals to 300 with"MODEL.DiffusionInst.NUM_PROPOSALS 300" in the inference command line. For the second problem, I use A100 which has 80G memory for bs=32. Alternatively, I think maybe you can change the random seed in dataloaders? I did not try this but in my case, all training freeze problem occurs in the same iteration. Changing the random seed may help this.

pinecho commented 1 year ago

@zhangxgu Thanks for your swift reply!