Open HengLan opened 6 years ago
By the way, I trained it on Windows 10 using MatConvNet-25. I do not know if this will affect the training.
This was already pointed out in previous issues. The default values for learning rate are too high, which leads to your problem, try lower values.
Hi, I have the same problem with training in which the loss values become NaN. I tried lower values for "steadyLR" and "gentleLR" parameters, but the problem still exists.
Hi, @mnnejati ,
You may try to change steadyLR to 0.0001 and gentleLR to 0.00001 to train the SSD. I did so, and the training process seems normal so far.
Best,
Hi, albanie,
Thanks a lot for your work.
Recently, I'd like to use SSD. To make sure the SSD work fine, I first tried to train the SSD on the provided VOC. I did not change anything. However, in training, the loss becomes weird as follows:
ssd_pascal_train Experiment name: ssd-pascal-0712-vt-32-300-flip-patch-distort
Training set: 0712 Testing set: 07
Prune checkpoints: 0 GPU: 1 Batch size: 32
Train + val: 1 Flip: 1 Patches: 1 Zoom: 0 Distort: 1
Learning Rate Schedule: 0.001 0.001 (warmup) 0.001 for 73 epochs 0.0001 for 35 epochs
Run experiment with these parameters?
y
orn
y Warning: The model appears to besimplenn
model. UsingfromSimpleNN
instead. In dagnn.DagNN.loadobj (line 19) In ssd_zoo (line 29) In ssd_init (line 28) In ssd_train (line 19) In ssd_pascal_train (line 213) Warning: The most recent version of vl_nnloss normalizes the loss by the batch size. The current version does not. A workaround is being used, but consider updating MatConvNet. In cnn_train_autonn (line 32) In ssd_train (line 20) In ssd_pascal_train (line 213) cnn_train_autonn: resetting GPUans =
CUDADevice with properties:
train: epoch 01: 1/538: 2.6 (2.6) Hz conf_loss: 19.484 loc_loss: 2.778 mbox_loss: 22.263 train: epoch 01: 2/538: 2.9 (3.2) Hz conf_loss: 16.702 loc_loss: 2.812 mbox_loss: 19.514 train: epoch 01: 3/538: 3.0 (3.1) Hz conf_loss: 16.259 loc_loss: 2.886 mbox_loss: 19.145 train: epoch 01: 4/538: 3.0 (3.2) Hz conf_loss: 15.540 loc_loss: 2.796 mbox_loss: 18.336 train: epoch 01: 5/538: 3.2 (3.2) Hz conf_loss: 15.562 loc_loss: 2.812 mbox_loss: 18.374 train: epoch 01: 6/538: 3.2 (3.1) Hz conf_loss: 16.118 loc_loss: 2.863 mbox_loss: 18.981 train: epoch 01: 7/538: 3.2 (3.1) Hz conf_loss: 19.285 loc_loss: 3.871 mbox_loss: 23.156 train: epoch 01: 8/538: 3.2 (3.3) Hz conf_loss: 1031.250 loc_loss: 111.065 mbox_loss: 1142.315 train: epoch 01: 9/538: 3.2 (3.2) Hz conf_loss: 3470643036412389161829400576.000 loc_loss: 614700336259389324632522752.000 mbox_loss: 4085343446458754781300129792.000 train: epoch 01: 10/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 11/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 12/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 13/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 14/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 15/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 16/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 17/538:Operation terminated by user during Net/eval (line 136)
Could you give me some help to solve this problem? or how does it happen?
Thanks