changlin31 / DNA

(CVPR 2020) Block-wisely Supervised Neural Architecture Search with Knowledge Distillation
235 stars 35 forks source link

Mismatch Results of DNA_c #10

Closed hongyuanyu closed 4 years ago

hongyuanyu commented 4 years ago

Hi,

Thanks for sharing the training code. I try to retrain DNA_c with this config: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 ~/imagenet --model DNA_c \ --epochs 500 --warmup-epochs 5 --batch-size 128 --lr 0.064 --opt rmsproptf --opt-eps 0.001 --sched step --decay-epochs 3 --decay-rate 0.963 --color-jitter 0.06 --drop 0.2 -j 8 --num-classes 1000 --model-ema After 500 epochs training, the best top1 accuracy is 77.2%, which is 0.6% lower than paper. *** Best metric: 77.19799990478515 (epoch 458)

jiefengpeng commented 4 years ago

Hi,hongyuanyu. We implemented our training with 32x RTX2080ti, 64 batch size/gpu and optimizer_step every 2 iterations, so that we can guarntee a 4096 total batch size and initial lr 0.256 as efficientnets suggested. Small batch size and initial lr might reduce the final performance. You can try an optimizer_step every 4 iterations with 128 batch size/gpu and 0.256 lr to guarantee a big batch size.

changlin31 commented 4 years ago

Hi,

As for ImageNet retraining of the searched models, we used a similar protocol with EfficientNet [30], i.e., a batch size of 4,096, an RMSprop optimizer with momentum 0.9, and an initial learning rate of 0.256 which decays by 0.97 every 2.4 epochs.

Our training config is: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 ~/imagenet --model DNA_c \ --epochs 500 --warmup-epochs 5 --batch-size 64 --lr 0.256 --opt rmsproptf --opt-eps 0.001 --sched step --decay-epochs 3 --decay-rate 0.963 --color-jitter 0.06 --drop 0.2 -j 8 --num-classes 1000 --model-ema with 4 nodes, i.e., 32 GPUs. And we step the optimizer every 2 training steps to simulate large training batch. We achieve the highest top1 accuracy 77.77% at epoch 351.

The differences are the total batch size: 32x2X64=4096 vs. 8x128=1024. And we decrease the learning rate using the linear rule: lr = 0.256x1024/4096 = 0.064 in the suggested setting. This change in total batch size was intended for easier reproduce, but we can not guarantee the performance.

You can try enlarging your total batch size or step your optimizer less frequently as suggested by @jiefengpeng .

hongyuanyu commented 4 years ago

Thanks!