Closed lukeN86 closed 3 years ago
Hello Lukas,
Thanks for your interest! We are double-checking our codes and the public release to see if there is any discrepancy. Will let you update shortly, stay tuned. Thank you.
Hello Lukas,
Here is an initial results from our investigation. Bascially, the network you're running is different from the one reported in the paper.
Current network in Github is:
Genotype(normal=[('skip_connect', 0), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 1), ('max_pool_3x3', 2), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 3), ('sep_conv_5x5', 4), ('skip_connect', 3), ('max_pool_3x3', 0), ('skip_connect', 3), ('sep_conv_3x3', 1)], normal_concat=[5, 6, 7, 8], reduce=[('skip_connect', 0), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 1), ('max_pool_3x3', 2), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 3), ('sep_conv_5x5', 4), ('skip_connect', 3), ('max_pool_3x3', 0), ('skip_connect', 3), ('sep_conv_3x3', 1)], reduce_concat=[5, 6, 7, 8])
The actual Model is: Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
The main difference between two models are 1). the operation space, the new one has a "dil_conv_5x5" operation which is not appear in the current netowork. 2) the current model shares same architecture in normal cell and reduce cell, but our new model has different architectures in normal and reduce cell.
Therefore, you can train our new model with same script but only simply change the line 118 in train.py to
genotype = Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
We will change the release soon...
Thank you, i've started trainng with the above architecture, I will keep you posted.
save path: checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16
Experiment dir : checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16
02/05 09:33:13 AM gpu device = 0
02/05 09:33:13 AM args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=600, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]
Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
train from the scratch
model init params values: tensor(100545.7500, device='cuda:0')
02/05 09:33:16 AM param size = 53.277834MB
do AutoAugment!
Files already downloaded and verified
cur_epoch is 0
Please note the updated architecture have a reduce cell. We're also running on our end, and will update shortly.
Hello,
i've run the training with the suggested architecture for 600 epochs, and the accuracy remained virtually the same. Perhaps I am missing something else too?
net = eval(args.arch)
print(net)
code = gen_code_from_list(net, node_num=int((len(net) / 4)))
# genotype = translator([code, code], max_node=int((len(net) / 4)))
genotype = Genotype(
normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2),
('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5],
reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2),
('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
print(genotype)
Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
train from the scratch
model init params values: tensor(100545.7500, device='cuda:0')
02/05 09:33:16 AM param size = 53.277834MB
do AutoAugment!
Files already downloaded and verified
cur_epoch is 0
02/05 09:33:18 AM epoch 0 lr 2.499966e-02
...
cur_epoch is 599
02/12 02:44:55 PM epoch 599 lr 0.000000e+00
...
02/12 03:02:58 PM valid_acc: 98.140000
current best acc is 98.25
02/12 03:02:58 PM best_acc: 98.250000
saved to: trained.pt
Thank you Lukas
Hang on there, we're getting similar results. Will need finetune after the 600 epochs. Keep you posted soon.
Hello Luke,
I'm no longer affiliated with FB, that's why taking us a bit longer to reproduce this results as we only have 1 V100 shared within the department. Please hang on there and we're working on it. Will update you once we will make an improvement. Thanks.
Hello Luke,
We are working on fine-tuning our model. So far, we load the checkpoint and set the drop_path_prob to 0.25. Then, we re-train the checkpoint 100 epochs(you can simply set the epochs to 700). Then, the best accuracy of the model would improve around 0.15. We will update you once we make an improvement. Thank you.
@lukeN86 Since we have limited GPUs, it will be great if you can finetune the model with 98.5 top-1 on your side with the above procedures. Thank you.
A quick update, we have fine-tune the model to 98.67 after following the above steps, and still working on it.
Okay, finally we have fine-tuned the model to 98.82, and please follow the procedures above, and this really need a lot of patience. Thanks.
Hi, I'm a bit confused about the final procedure of getting to 98.82: could you please write it here step-by-step?
@AwesomeLemon Can you please run the training code to see the result first?
Hi,
firstly, thank you for great work and for sharing your code!
I'm trying to run train the best architecture on CIFAR-10, yet I cannot seem to get the reported numbers, I guess I must be missing something. I tried both 1500 epochs as well as 600 as you suggested in another ticket, but still the accuracy seems lower than 99%. Is there perhaps another hyperparameter setting I need to adjust ?
python train.py --auxiliary --batch_size=32 --init_ch=128 --layer=24 --arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]' --model_ema --model-ema-decay 0.9999 --auto_augment --epochs 600 `2021-01-28 03:44:52,831 gpu device = 0 2021-01-28 03:44:52,831 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=600, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003) 2021-01-28 03:44:56,852 param size = 44.591498MB 2021-01-28 03:44:59,182 epoch 0 lr 2.499966e-02 ... 2021-02-03 00:52:40,405 valid_acc: 98.410000 2021-02-03 00:52:40,405 best_acc: 98.490000 2021-02-03 00:52:42,316 epoch 599 lr 0.000000e+00
or for 1500 epochs respectivelly: ```
2021-01-11 10:40:08,600 gpu device = 0 2021-01-11 10:40:08,600 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=1500, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-1500-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003) 2021-01-11 10:40:11,371 param size = 44.591498MB 2021-01-11 10:40:12,744 epoch 0 lr 2.499995e-02 2021-01-27 07:02:55,357 valid_acc: 98.250000 2021-01-27 07:02:55,358 best_acc: 98.310000 2021-01-27 07:02:57,400 epoch 1499 lr 5.483112e-08
Thank you! Lukas hi lukeN86 ,do you get the result of 98.49 with ema_model ?
Hi,
firstly, thank you for great work and for sharing your code!
I'm trying to run train the best architecture on CIFAR-10, yet I cannot seem to get the reported numbers, I guess I must be missing something. I tried both 1500 epochs as well as 600 as you suggested in another ticket, but still the accuracy seems lower than 99%. Is there perhaps another hyperparameter setting I need to adjust ?
or for 1500 epochs respectivelly: ```
Thank you! Lukas