A question about the SSD training

HengLan commented 6 years ago

Hi, albanie,

Thanks a lot for your work.

Recently, I'd like to use SSD. To make sure the SSD work fine, I first tried to train the SSD on the provided VOC. I did not change anything. However, in training, the loss becomes weird as follows:

ssd_pascal_train Experiment name: ssd-pascal-0712-vt-32-300-flip-patch-distort

Training set: 0712 Testing set: 07

Prune checkpoints: 0 GPU: 1 Batch size: 32

Train + val: 1 Flip: 1 Patches: 1 Zoom: 0 Distort: 1

Learning Rate Schedule: 0.001 0.001 (warmup) 0.001 for 73 epochs 0.0001 for 35 epochs

Run experiment with these parameters? y or n y Warning: The model appears to be simplenn model. Using fromSimpleNN instead. In dagnn.DagNN.loadobj (line 19) In ssd_zoo (line 29) In ssd_init (line 28) In ssd_train (line 19) In ssd_pascal_train (line 213) Warning: The most recent version of vl_nnloss normalizes the loss by the batch size. The current version does not. A workaround is being used, but consider updating MatConvNet. In cnn_train_autonn (line 32) In ssd_train (line 20) In ssd_pascal_train (line 213) cnn_train_autonn: resetting GPU

ans =

CUDADevice with properties:

                  Name: 'GeForce GTX 1080'
                 Index: 1
     ComputeCapability: '6.1'
        SupportsDouble: 1
         DriverVersion: 9.1000
        ToolkitVersion: 8
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 8.5899e+09
       AvailableMemory: 7.0175e+09
   MultiprocessorCount: 20
          ClockRateKHz: 1809500
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

train: epoch 01: 1/538: 2.6 (2.6) Hz conf_loss: 19.484 loc_loss: 2.778 mbox_loss: 22.263 train: epoch 01: 2/538: 2.9 (3.2) Hz conf_loss: 16.702 loc_loss: 2.812 mbox_loss: 19.514 train: epoch 01: 3/538: 3.0 (3.1) Hz conf_loss: 16.259 loc_loss: 2.886 mbox_loss: 19.145 train: epoch 01: 4/538: 3.0 (3.2) Hz conf_loss: 15.540 loc_loss: 2.796 mbox_loss: 18.336 train: epoch 01: 5/538: 3.2 (3.2) Hz conf_loss: 15.562 loc_loss: 2.812 mbox_loss: 18.374 train: epoch 01: 6/538: 3.2 (3.1) Hz conf_loss: 16.118 loc_loss: 2.863 mbox_loss: 18.981 train: epoch 01: 7/538: 3.2 (3.1) Hz conf_loss: 19.285 loc_loss: 3.871 mbox_loss: 23.156 train: epoch 01: 8/538: 3.2 (3.3) Hz conf_loss: 1031.250 loc_loss: 111.065 mbox_loss: 1142.315 train: epoch 01: 9/538: 3.2 (3.2) Hz conf_loss: 3470643036412389161829400576.000 loc_loss: 614700336259389324632522752.000 mbox_loss: 4085343446458754781300129792.000 train: epoch 01: 10/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 11/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 12/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 13/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 14/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 15/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 16/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 17/538:Operation terminated by user during Net/eval (line 136)

Could you give me some help to solve this problem? or how does it happen?

Thanks

HengLan commented 6 years ago

By the way, I trained it on Windows 10 using MatConvNet-25. I do not know if this will affect the training.

zacr0 commented 6 years ago

This was already pointed out in previous issues. The default values for learning rate are too high, which leads to your problem, try lower values.

mnnejati commented 6 years ago

Hi, I have the same problem with training in which the loss values become NaN. I tried lower values for "steadyLR" and "gentleLR" parameters, but the problem still exists.

HengLan commented 6 years ago

Hi, @mnnejati ,

You may try to change steadyLR to 0.0001 and gentleLR to 0.00001 to train the SSD. I did so, and the training process seems normal so far.

Best,

albanie / mcnSSD

A question about the SSD training #27