dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
4.98k stars 1.33k forks source link

[YOLACT++] Moving average ignored on custom dataset #675

Open Aykatia opened 2 years ago

Aykatia commented 2 years ago

@dbolya Training on custom dataset, loss explode after a few epochs. More exactly, class confidence loss and box regression loss explode from iteration 450.

iteration 440: Loss B : 1.47 Loss C: 2.55 Total Loss: 6.099

iteration 450: Loss B : 4.7671611230863776e+22 Loss C: 5.5295239645005989e+22 Total Loss: 1.0296685176870276e+23

Train dataset size : 19 000 Val dataset size : 5 000

Hyperparameters : batch-size : 32 max_size : 550 validation_size : 5000 I edit the script, in order to put max_iter as arguments in train.py : so i set the max_iter to 10 000. weight : resnet101_reducedfc.pth. number classes : 3

i launch the training on two sagemaker instances each one have 4 GPU's. On one instance the loss is great but in the other it explode.

Is it normal that the loss diverge ? if not what is the problem with my training ?

Any help would be greatly appreciated. please. ( it's important thank you)

Marjanmoodi commented 2 years ago

@Aykatia Could you share with me any document for running this model (yolact) on sagemaker instances?