CanyonWind / Single-Path-One-Shot-NAS-MXNet

Single Path One-Shot NAS MXNet implementation with full training and searching pipeline. Support both Block and Channel Selection. Searched models better than the original paper are provided.
151 stars 22 forks source link

the loss of supernet can't converge #13

Closed cavalleria closed 4 years ago

cavalleria commented 4 years ago

Hi,thanks for your excellent work! I am preparing to reappear your work,but when trainning supernet, the loss can't converge, and val top-1 error is don't decline. my trainning scripts is python train_imagenet.py \ --rec-train ~/facedata.mxnet.hot/rec2/train.rec --rec-train-idx ~/facedata.mxnet.hot/rec2/train.idx \ --rec-val ~/facedata.mxnet.hot/rec2/val.rec --rec-val-idx ~/facedata.mxnet.hot/rec2/val.idx \ --mode imperative --lr 0.65 --wd 0.00004 --lr-mode cosine --dtype float16\ --num-epochs 120 --batch-size 64 --num-gpus 1 -j 16 \ --label-smoothing --no-wd --warmup-epochs 5 --use-rec \ --model ShuffleNas \ --epoch-start-cs 60 --cs-warm-up --use-se --last-conv-after-pooling --channels-layout OneShot \ --save-dir params_shufflenas_supernet+ --logging-file ./logs/shufflenas_supernet+.log \ --train-upper-constraints flops-330-params-5.0 --train-bottom-constraints flops-190-params-2.8 \ --train-constraint-method evolution

and when run test ,it will report a error, i change select_all_channels=True in line 435 and 440 of train_imagenet.py

CanyonWind commented 4 years ago

Hi, thanks for your interest. Could you please give --train-constraint-method random a try? I used to find that using evolution constraints from the beginning is hard to converge. What I did before was to train the supernet without constraints/ with random constraints for 30/60 epochs then use evolution constraints for the rest. Please feel free to let me know whether it helps.

CanyonWind commented 4 years ago

I tried the evolution constraints for 2 epochs, please refer to the below log.

Namespace(batch_norm=False, batch_size=64, block_choices='0, 0, 3, 1, 1, 1, 0, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2, 1, 3, 2', channel_choices='6, 5, 3, 5, 2, 6, 3, 4, 2, 5, 7, 5, 4, 6, 7, 4, 4, 5, 4, 3', channels_layout='OneShot', crop_ratio=0.875, cs_warm_up=False, data_dir='~/.mxnet/datasets/imagenet', dtype='float16', epoch_start_cs=0, flop_param_method='lookup_table', hard_weight=0.5, ignore_first_two_cs=False, input_size=224, label_smoothing=True, last_conv_after_pooling=True, last_gamma=False, log_interval=50, logging_file='./logs/shufflenas_supernet+_wc.log', lr=0.65, lr_decay=0.1, lr_decay_epoch='40,60', lr_decay_period=0, lr_mode='cosine', mixup=False, mixup_alpha=0.2, mixup_off_epoch=0, mode='imperative', model='ShuffleNas', momentum=0.9, no_wd=True, num_epochs=120, num_gpus=1, num_workers=16, rec_train='/home/alex/imagenet/rec/train.rec', rec_train_idx='/home/alex/imagenet/rec/train.idx', rec_val='/home/alex/imagenet/rec/val.rec', rec_val_idx='/home/alex/imagenet/rec/val.idx', reduced_dataset_scale=1, resume_epoch=0, resume_params='', resume_states='', save_dir='params_shufflenas_supernet+_wc', save_frequency=10, teacher=None, temperature=20, train_bottom_constraints='flops-190-params-2.8', train_constraint_method='evolution', train_upper_constraints='flops-330-params-5.0', use_all_blocks=False, use_all_channels=False, use_gn=False, use_pretrained=False, use_rec=True, use_se=True, warmup_epochs=5, warmup_lr=0.0, wd=4e-05)
Epoch[0] Batch [49] Speed: 267.692524 samples/sec   accuracy=0.000937   lr=0.000325
Epoch[0] Batch [99] Speed: 420.260783 samples/sec   accuracy=0.000625   lr=0.000649
Epoch[0] Batch [149]    Speed: 434.132753 samples/sec   accuracy=0.000625   lr=0.000974
Epoch[0] Batch [199]    Speed: 455.834203 samples/sec   accuracy=0.000625   lr=0.001299
Epoch[0] Batch [249]    Speed: 444.076933 samples/sec   accuracy=0.000812   lr=0.001624
Epoch[0] Batch [299]    Speed: 446.958638 samples/sec   accuracy=0.000937   lr=0.001948
Epoch[0] Batch [349]    Speed: 440.735658 samples/sec   accuracy=0.000937   lr=0.002273
Epoch[0] Batch [399]    Speed: 442.374003 samples/sec   accuracy=0.000937   lr=0.002598
Epoch[0] Batch [449]    Speed: 435.325226 samples/sec   accuracy=0.001007   lr=0.002922
Epoch[0] Batch [499]    Speed: 439.740531 samples/sec   accuracy=0.000969   lr=0.003247
Epoch[0] Batch [549]    Speed: 449.363078 samples/sec   accuracy=0.000966   lr=0.003572
Epoch[0] Batch [599]    Speed: 427.282463 samples/sec   accuracy=0.000964   lr=0.003897
Epoch[0] Batch [649]    Speed: 439.999006 samples/sec   accuracy=0.000937   lr=0.004221
Epoch[0] Batch [699]    Speed: 454.338982 samples/sec   accuracy=0.000915   lr=0.004546
Epoch[0] Batch [749]    Speed: 442.066367 samples/sec   accuracy=0.000854   lr=0.004871
Epoch[0] Batch [799]    Speed: 447.217162 samples/sec   accuracy=0.000879   lr=0.005195
Epoch[0] Batch [849]    Speed: 418.756385 samples/sec   accuracy=0.000864   lr=0.005520
Epoch[0] Batch [899]    Speed: 430.115587 samples/sec   accuracy=0.000868   lr=0.005845
Epoch[0] Batch [949]    Speed: 422.384265 samples/sec   accuracy=0.000872   lr=0.006170
Epoch[0] Batch [999]    Speed: 442.137708 samples/sec   accuracy=0.000937   lr=0.006494
...
Epoch[0] Batch [19799]  Speed: 434.382863 samples/sec   accuracy=0.010476   lr=0.128586
Epoch[0] Batch [19849]  Speed: 442.456485 samples/sec   accuracy=0.010524   lr=0.128910
Epoch[0] Batch [19899]  Speed: 431.092918 samples/sec   accuracy=0.010570   lr=0.129235
Epoch[0] Batch [19949]  Speed: 445.330133 samples/sec   accuracy=0.010624   lr=0.129560
Epoch[0] Batch [19999]  Speed: 444.129423 samples/sec   accuracy=0.010666   lr=0.129884
[Epoch 0] training: accuracy=0.010680
[Epoch 0] speed: 437 samples/sec    time cost: 3014.399720
[Epoch 0] validation: err-top1=0.966212 err-top5=0.888407
Epoch[1] Batch [49] Speed: 441.229930 samples/sec   accuracy=0.030937   lr=0.130326
Epoch[1] Batch [99] Speed: 431.210921 samples/sec   accuracy=0.029844   lr=0.130651
Epoch[1] Batch [149]    Speed: 451.693710 samples/sec   accuracy=0.028542   lr=0.130975
Epoch[1] Batch [199]    Speed: 453.126118 samples/sec   accuracy=0.027344   lr=0.131300
Epoch[1] Batch [249]    Speed: 439.301388 samples/sec   accuracy=0.027250   lr=0.131625
Epoch[1] Batch [299]    Speed: 452.420660 samples/sec   accuracy=0.028021   lr=0.131950
Epoch[1] Batch [349]    Speed: 456.589121 samples/sec   accuracy=0.028705   lr=0.132274
Epoch[1] Batch [399]    Speed: 441.290773 samples/sec   accuracy=0.028555   lr=0.132599
Epoch[1] Batch [449]    Speed: 443.353213 samples/sec   accuracy=0.028889   lr=0.132924
Epoch[1] Batch [499]    Speed: 455.609001 samples/sec   accuracy=0.029063   lr=0.133248
Epoch[1] Batch [549]    Speed: 435.873114 samples/sec   accuracy=0.029261   lr=0.133573
Epoch[1] Batch [599]    Speed: 435.406145 samples/sec   accuracy=0.028958   lr=0.133898
Epoch[1] Batch [649]    Speed: 432.422730 samples/sec   accuracy=0.028990   lr=0.134223
Epoch[1] Batch [699]    Speed: 445.527597 samples/sec   accuracy=0.028795   lr=0.134547
Epoch[1] Batch [749]    Speed: 445.781965 samples/sec   accuracy=0.028958   lr=0.134872
Epoch[1] Batch [799]    Speed: 437.717070 samples/sec   accuracy=0.029004   lr=0.135197
Epoch[1] Batch [849]    Speed: 450.319020 samples/sec   accuracy=0.028732   lr=0.135521
Epoch[1] Batch [899]    Speed: 446.804164 samples/sec   accuracy=0.028750   lr=0.135846
Epoch[1] Batch [949]    Speed: 448.955765 samples/sec   accuracy=0.028766   lr=0.136171
Epoch[1] Batch [999]    Speed: 429.807388 samples/sec   accuracy=0.028875   lr=0.136496
...
Epoch[1] Batch [19799]  Speed: 445.965960 samples/sec   accuracy=0.054782   lr=0.258587
Epoch[1] Batch [19849]  Speed: 439.394236 samples/sec   accuracy=0.054872   lr=0.258912
Epoch[1] Batch [19899]  Speed: 431.452251 samples/sec   accuracy=0.054946   lr=0.259236
Epoch[1] Batch [19949]  Speed: 445.569749 samples/sec   accuracy=0.054993   lr=0.259561
Epoch[1] Batch [19999]  Speed: 430.832503 samples/sec   accuracy=0.055055   lr=0.259886
[Epoch 1] training: accuracy=0.055072
[Epoch 1] speed: 442 samples/sec    time cost: 2977.685636
[Epoch 1] validation: err-top1=0.901088 err-top5=0.748339
Epoch[2] Batch [49] Speed: 435.165452 samples/sec   accuracy=0.089375   lr=0.260327
Epoch[2] Batch [99] Speed: 438.497906 samples/sec   accuracy=0.090313   lr=0.260652
Epoch[2] Batch [149]    Speed: 442.370125 samples/sec   accuracy=0.088438   lr=0.260977
Epoch[2] Batch [199]    Speed: 449.992227 samples/sec   accuracy=0.084687   lr=0.261301
Epoch[2] Batch [249]    Speed: 451.044595 samples/sec   accuracy=0.084187   lr=0.261626
Epoch[2] Batch [299]    Speed: 435.895423 samples/sec   accuracy=0.083646   lr=0.261951
Epoch[2] Batch [349]    Speed: 442.705869 samples/sec   accuracy=0.083571   lr=0.262276
Epoch[2] Batch [399]    Speed: 431.949651 samples/sec   accuracy=0.083086   lr=0.262600
Epoch[2] Batch [449]    Speed: 448.379354 samples/sec   accuracy=0.083403   lr=0.262925
Epoch[2] Batch [499]    Speed: 439.455696 samples/sec   accuracy=0.083531   lr=0.263250
Epoch[2] Batch [549]    Speed: 419.410924 samples/sec   accuracy=0.082812   lr=0.263574
Epoch[2] Batch [599]    Speed: 435.331664 samples/sec   accuracy=0.082474   lr=0.263899
Epoch[2] Batch [649]    Speed: 430.067405 samples/sec   accuracy=0.082187   lr=0.264224
Epoch[2] Batch [699]    Speed: 456.241039 samples/sec   accuracy=0.082388   lr=0.264549
Epoch[2] Batch [749]    Speed: 452.860384 samples/sec   accuracy=0.081917   lr=0.264873
Epoch[2] Batch [799]    Speed: 432.486923 samples/sec   accuracy=0.081738   lr=0.265198
Epoch[2] Batch [849]    Speed: 450.029449 samples/sec   accuracy=0.081801   lr=0.265523
Epoch[2] Batch [899]    Speed: 445.616156 samples/sec   accuracy=0.081233   lr=0.265847
Epoch[2] Batch [949]    Speed: 430.188969 samples/sec   accuracy=0.081299   lr=0.266172
Epoch[2] Batch [999]    Speed: 430.283522 samples/sec   accuracy=0.081641   lr=0.266497
...
cavalleria commented 4 years ago

Hi, thanks for your interest. Could you please give --train-constraint-method random a try? I used to find that using evolution constraints from the beginning is hard to converge. What I did before was to train the supernet without constraints/ with random constraints for 30/60 epochs then use evolution constraints for the rest. Please feel free to let me know whether it helps.

Thanks for your quick reply, I noticed that you modified cs-warm-up = false and epoch-start-cs = 0, I modified my training script according to your training log, and run 3 epochs, acc and val top -1 error looks normal. Then I have some questions 1.The README describes supernet training details as follows

The reason why we did this in the supernet training is that during our experiments we found, for supernet without SE, doing Block Selection from beginning works well, nevertheless doing Channel Selection from the beginning will cause the network not converging at all. The Channel Selection range needs to be gradually enlarged otherwise it will crash with free-fall drop accuracy. And the range can only be allowed for (0.6 ~ 2.0). Smaller channel scales will make the network crashing too. For supernet with SE, Channel Selection with the full choices (0.2 ~ 2.0) can be used from the beginning and it converges. However, doing this seems like harming accuracy. Compared to the same se-supernet with Channel Selection warm-up, the Channel Selection from scratch model has been always left behind 10% training accuracy during the whole procedure.

my understanding is that if use_se = true, channel selection can be used from the beginning and it can converges (epoch-start-cs = 0, cs-warm-up = false), but left behind 10% training accuracy compared to same se-supernet with channel selection warm-up (epoch-start-cs = 0, cs-warm-up = true),is it right? 2.if i train the supernet with use-se=true, epoch-start-cs = 0 and cs-warm-up = true , but can't converges, should I follow --train-constraint-method none / random / evolution (epoch 0 ~ 30/30 ~ 60/60 ~ 120) to progressively train the supernet. 3.when i use 8 titanx gpus, whether the learning rate should be increased 8 times(8*0.65), and i find multi gpu often idling. thx~

CanyonWind commented 4 years ago
  1. Yes, you are right. At least this phenomenon was found according to my few experiments.
  2. I'm not sure about this part. Because the channel selection warm-up experiment was done quite far ago. I usually just train se-supernet with no warm-up now. BTW, --train-constraint-method none / random / evolution (epoch 0 ~ 30/30 ~ 60/60 ~ 120) is for how to make evolution constraint work but not for channel selection warm up. Nevertheless, you are welcomed to give it a try.
  3. Yes, this is still a pain in the axx for me too... Multi GPU support for supernet training is still problematic. However, I don't have too much time to spend on it right now. Sorry about that.
CanyonWind commented 4 years ago

Close the issue for no further response. Please feel free to reopen if necessary.