@dbolya Training on custom dataset, loss explode after a few epochs. More exactly, class confidence loss and box regression loss explode from iteration 450.
iteration 440:
Loss B : 1.47
Loss C: 2.55
Total Loss: 6.099
iteration 450:
Loss B : 4.7671611230863776e+22
Loss C: 5.5295239645005989e+22
Total Loss: 1.0296685176870276e+23
Hyperparameters :
batch-size : 32
max_size : 550
validation_size : 5000
I edit the script, in order to put max_iter as arguments in train.py : so i set the max_iter to 10 000.
weight : resnet101_reducedfc.pth.
number classes : 3
i launch the training on two sagemaker instances each one have 4 GPU's. On one instance the loss is great but in the other it explode.
Is it normal that the loss diverge ? if not what is the problem with my training ?
Any help would be greatly appreciated. please. ( it's important thank you)
@dbolya Training on custom dataset, loss explode after a few epochs. More exactly, class confidence loss and box regression loss explode from iteration 450.
iteration 440: Loss B : 1.47 Loss C: 2.55 Total Loss: 6.099
iteration 450: Loss B : 4.7671611230863776e+22 Loss C: 5.5295239645005989e+22 Total Loss: 1.0296685176870276e+23
Train dataset size : 19 000 Val dataset size : 5 000
Hyperparameters : batch-size : 32 max_size : 550 validation_size : 5000 I edit the script, in order to put max_iter as arguments in train.py : so i set the max_iter to 10 000. weight : resnet101_reducedfc.pth. number classes : 3
i launch the training on two sagemaker instances each one have 4 GPU's. On one instance the loss is great but in the other it explode.
Is it normal that the loss diverge ? if not what is the problem with my training ?
Any help would be greatly appreciated. please. ( it's important thank you)