Open zhangpzh opened 5 years ago
@zhangpzh change the number of gpus will affect the effective batch size on each GPU if syncBN is used, which is not present for SSD training example right now. I think you will get similar results using 4 GPUs and 32 batch-size, or 8 gpus with 64 batch-size and a little bit larger learning rate.
I attempted to train with larger base lr that bigger than 0.001 ( specifically, 0.002~0.005) but all of these bring about NaN CE and smoothL1 loss values. This verifies that 0.001 might be the largest tolerable base learning rate. But unfortunately, as mentioned above, training with base lr of 0.001, the starting base loss values of smoothL1 is 6.x but not 3.x in the official log.
For the underline reason, I suspect whether it is about:
(1) the pretrained model. My reproduced SSD512 model is trained based on ~/.mxnet/models/vgg16_atrous-4fa2e1ad.params which is automatically downloaded from the official project site.
(2) the weight balanced weight \lambda between category classification and box regression loss penalization is not 1.0 during the official training ?
Could you please give me more suggestions ?
@zhreshold @pluskid
@zhangpzh Can you follow exactly the same bash command provided to see whether you can reproduce?
Oh ! Of course, I 've already tried the same bash command as I said in previous comments -- "For strong ablation, I also strictly execute the official training comman ....". The results are similar to my primitive command. (I didn't edit the words in markdown format which made them hard to read. My carelessness.)
Beyond that, I also consider to increase the base learning rate based on the official bash command (only make difference on the --lr). Using exactly the same command, I could only obtain an validation mAP of 11.1 at epoch 10 while the official log show 17.1. And I found that for mine, adopting higher learning rates could bring about better convergence tabulated as follows:
Base_lr | 0.001 | 0.0012 | 0.0014 | 0.0015 | ... | official 0.001 |
---|---|---|---|---|---|---|
mAP@epoch=10 | 11.1 | 13.5 | 14.0 | loss nan | ... | 17.1 |
However, when the basic learning rate exceeds 0.0014, both of CE and smoothL1 losses becomes NaN which prevent us from approaching the official results.
By the way, the faster-rcnn detector could be well reproduced, nearly the same as the official one.
@zhreshold
@zhangpzh Thanks for the report. I will investigate once I got time on it.
Hello,
Abstract: I tried to train SSD 512 on MS-COCO dataset. By comparing my local train.log information with the official log https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/logs/detection/ssd_512_vgg16_atrous_coco_train.log, the convergence speed of my reproduced version is low, as well as the validation mAP.
Configuration: GPU: Titan V gluoncv: 0.4.0 mxnet: 1.5.0
My training command: python train_ssd.py --data-shape=512 --dataset=coco --gpus=0,1,2,3,4,5,6,7 --num-workers=14 --batch-size=64
The official command: python3 train_ssd.py --gpus 0,1,2,3 -j 32 --network vgg16_atrous --data-shape 512 --dataset coco --lr 0.001 --lr-decay-epoch 160,200 --lr-decay 0.1 --epochs 240 (https://raw.githubusercontent.com/dmlc/web-data/master/gluoncv/logs/detection/ssd_512_vgg16_atrous_coco.sh)
Note that the lr, lr-decay, epochs, vgg16_atrous are default configuration.
So possible difference is about the python2 vs. python3 (which I found no impact); worker number (should be no impact); GPU architecture ?
The log comparison is digested as follows:
Start training from [Epoch 0] ... ... [Epoch 0][Batch 499], Speed: 81.676 samples/sec, CrossEntropy=6.699, SmoothL1=2.947 [Epoch 0][Batch 599], Speed: 79.501 samples/sec, CrossEntropy=6.529, SmoothL1=2.880 [Epoch 0][Batch 699], Speed: 73.161 samples/sec, CrossEntropy=6.406, SmoothL1=2.824 [Epoch 0][Batch 799], Speed: 80.424 samples/sec, CrossEntropy=6.300, SmoothL1=2.776 [Epoch 0][Batch 899], Speed: 78.268 samples/sec, CrossEntropy=6.210, SmoothL1=2.736 [Epoch 0][Batch 999], Speed: 77.649 samples/sec, CrossEntropy=6.136, SmoothL1=2.703 ... ...