dmlc / gluon-cv

Gluon CV Toolkit
http://gluon-cv.mxnet.io
Apache License 2.0
5.84k stars 1.22k forks source link

Train SSD512 on COCO: Low validation accuracy and high smoothL1 loss compared with the official log #513

Open zhangpzh opened 5 years ago

zhangpzh commented 5 years ago

Hello,

Abstract: I tried to train SSD 512 on MS-COCO dataset. By comparing my local train.log information with the official log https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/logs/detection/ssd_512_vgg16_atrous_coco_train.log, the convergence speed of my reproduced version is low, as well as the validation mAP.

Configuration: GPU: Titan V gluoncv: 0.4.0 mxnet: 1.5.0

My training command: python train_ssd.py --data-shape=512 --dataset=coco --gpus=0,1,2,3,4,5,6,7 --num-workers=14 --batch-size=64

The official command: python3 train_ssd.py --gpus 0,1,2,3 -j 32 --network vgg16_atrous --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240 (https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/logs/detection/ssd_512_vgg16_atrous_coco.sh)

Note that the lr, lr-decay, epochs, vgg16_atrous are default configuration.

So possible difference is about the python2 vs. python3 (which I found no impact); worker number (should be no impact); GPU architecture ?

The log comparison is digested as follows:

  1. official log:

Start training from [Epoch 0] ... ... [Epoch 0][Batch 499], Speed: 81.676 samples/sec, CrossEntropy=6.699, SmoothL1=2.947 [Epoch 0][Batch 599], Speed: 79.501 samples/sec, CrossEntropy=6.529, SmoothL1=2.880 [Epoch 0][Batch 699], Speed: 73.161 samples/sec, CrossEntropy=6.406, SmoothL1=2.824 [Epoch 0][Batch 799], Speed: 80.424 samples/sec, CrossEntropy=6.300, SmoothL1=2.776 [Epoch 0][Batch 899], Speed: 78.268 samples/sec, CrossEntropy=6.210, SmoothL1=2.736 [Epoch 0][Batch 999], Speed: 77.649 samples/sec, CrossEntropy=6.136, SmoothL1=2.703 ... ...


=17.1
[Epoch 10][Batch 99], Speed: 48.707 samples/sec, CrossEntropy=3.166, SmoothL1=1.541
[Epoch 10][Batch 199], Speed: 84.537 samples/sec, CrossEntropy=3.188, SmoothL1=1.569
[Epoch 10][Batch 299], Speed: 78.452 samples/sec, CrossEntropy=3.186, SmoothL1=1.573
[Epoch 10][Batch 399], Speed: 84.219 samples/sec, CrossEntropy=3.189, SmoothL1=1.581
[Epoch 10][Batch 499], Speed: 76.266 samples/sec, CrossEntropy=3.186, SmoothL1=1.581

2. my log:

...
...
[Epoch 0][Batch 799], Speed: 141.291 samples/sec, CrossEntropy=6.083, SmoothL1=5.510
[Epoch 0][Batch 899], Speed: 50.317 samples/sec, CrossEntropy=5.995, SmoothL1=5.486
[Epoch 0][Batch 999], Speed: 135.737 samples/sec, CrossEntropy=5.919, SmoothL1=5.448
[Epoch 0][Batch 1099], Speed: 30.575 samples/sec, CrossEntropy=5.855, SmoothL1=5.425
[Epoch 0][Batch 1199], Speed: 23.837 samples/sec, CrossEntropy=5.800, SmoothL1=5.400
[Epoch 0][Batch 1299], Speed: 143.056 samples/sec, CrossEntropy=5.750, SmoothL1=5.382
[Epoch 0][Batch 1399], Speed: 152.799 samples/sec, CrossEntropy=5.704, SmoothL1=5.357
[Epoch 0][Batch 1499], Speed: 138.315 samples/sec, CrossEntropy=5.660, SmoothL1=5.346
[Epoch 0][Batch 1599], Speed: 145.956 samples/sec, CrossEntropy=5.619, SmoothL1=5.328
[Epoch 0][Batch 1699], Speed: 151.817 samples/sec, CrossEntropy=5.581, SmoothL1=5.301
[Epoch 0][Batch 1799], Speed: 151.723 samples/sec, CrossEntropy=5.545, SmoothL1=5.285
[Epoch 0] Training cost: 2128.206, CrossEntropy=5.535, SmoothL1=5.275
...
...
toothbrush=0.0
~~~~ MeanAP @ IoU=[0.50,0.95] ~~~~
=11.1
[Epoch 11][Batch 99], Speed: 15.255 samples/sec, CrossEntropy=3.814, SmoothL1=4.368
[Epoch 11][Batch 199], Speed: 139.160 samples/sec, CrossEntropy=3.806, SmoothL1=4.360
[Epoch 11][Batch 299], Speed: 76.672 samples/sec, CrossEntropy=3.798, SmoothL1=4.312
[Epoch 11][Batch 399], Speed: 136.141 samples/sec, CrossEntropy=3.794, SmoothL1=4.342
[Epoch 11][Batch 499], Speed: 139.142 samples/sec, CrossEntropy=3.790, SmoothL1=4.332
[Epoch 11][Batch 599], Speed: 136.717 samples/sec, CrossEntropy=3.790, SmoothL1=4.302

It could obviously observed that, around epoch 10, the validation mAP of mine is much lower than that of the official implementation (11.1 vs. 17.1). And this should be the smoothL1 loss to blame. SmoothL1 loss in the offical log starts with small base (3.x) and decrease fast to 1.x while mine starts with larger base (6.x) and decrease much slower (4.x).

For strong ablation, I also strictly execute the official training command.
(I also split two sub-versions of exporting or not exporting MXNET_ENABLE_GPU_P2P=0)
And the log results are still very similar to the previous 8-card version.

Looking forward to your help.

BTW, I configure the already downloaded MS-COCO by mscoco.py --download-dir option. Should I conduct an ablation of using the mscoco.py downloaded mscoco dataset ?

@zhreshold 
zhreshold commented 5 years ago

@zhangpzh change the number of gpus will affect the effective batch size on each GPU if syncBN is used, which is not present for SSD training example right now. I think you will get similar results using 4 GPUs and 32 batch-size, or 8 gpus with 64 batch-size and a little bit larger learning rate.

zhangpzh commented 5 years ago

I attempted to train with larger base lr that bigger than 0.001 ( specifically, 0.002~0.005) but all of these bring about NaN CE and smoothL1 loss values. This verifies that 0.001 might be the largest tolerable base learning rate. But unfortunately, as mentioned above, training with base lr of 0.001, the starting base loss values of smoothL1 is 6.x but not 3.x in the official log.

For the underline reason, I suspect whether it is about:

(1) the pretrained model. My reproduced SSD512 model is trained based on ~/.mxnet/models/vgg16_atrous-4fa2e1ad.params which is automatically downloaded from the official project site.

(2) the weight balanced weight \lambda between category classification and box regression loss penalization is not 1.0 during the official training ?

Could you please give me more suggestions ?

@zhreshold @pluskid

zhreshold commented 5 years ago

@zhangpzh Can you follow exactly the same bash command provided to see whether you can reproduce?

zhangpzh commented 5 years ago

Oh ! Of course, I 've already tried the same bash command as I said in previous comments -- "For strong ablation, I also strictly execute the official training comman ....". The results are similar to my primitive command. (I didn't edit the words in markdown format which made them hard to read. My carelessness.)

Beyond that, I also consider to increase the base learning rate based on the official bash command (only make difference on the --lr). Using exactly the same command, I could only obtain an validation mAP of 11.1 at epoch 10 while the official log show 17.1. And I found that for mine, adopting higher learning rates could bring about better convergence tabulated as follows:

Base_lr 0.001 0.0012 0.0014 0.0015 ... official 0.001
mAP@epoch=10 11.1 13.5 14.0 loss nan ... 17.1

However, when the basic learning rate exceeds 0.0014, both of CE and smoothL1 losses becomes NaN which prevent us from approaching the official results.

By the way, the faster-rcnn detector could be well reproduced, nearly the same as the official one.

@zhreshold

zhreshold commented 5 years ago

@zhangpzh Thanks for the report. I will investigate once I got time on it.