Reproducing CIFAR10 experiment

lukeN86 commented 3 years ago

Hi,

firstly, thank you for great work and for sharing your code!

I'm trying to run train the best architecture on CIFAR-10, yet I cannot seem to get the reported numbers, I guess I must be missing something. I tried both 1500 epochs as well as 600 as you suggested in another ticket, but still the accuracy seems lower than 99%. Is there perhaps another hyperparameter setting I need to adjust ?

python train.py --auxiliary --batch_size=32 --init_ch=128 --layer=24 --arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]' --model_ema --model-ema-decay 0.9999 --auto_augment --epochs 600

`2021-01-28 03:44:52,831 gpu device = 0
2021-01-28 03:44:52,831 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=600, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
2021-01-28 03:44:56,852 param size = 44.591498MB
2021-01-28 03:44:59,182 epoch 0 lr 2.499966e-02
...
2021-02-03 00:52:40,405 valid_acc: 98.410000
2021-02-03 00:52:40,405 best_acc: 98.490000
2021-02-03 00:52:42,316 epoch 599 lr 0.000000e+00

or for 1500 epochs respectivelly: ```

2021-01-11 10:40:08,600 gpu device = 0
2021-01-11 10:40:08,600 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=1500, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-1500-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
2021-01-11 10:40:11,371 param size = 44.591498MB
2021-01-11 10:40:12,744 epoch 0 lr 2.499995e-02

2021-01-27 07:02:55,357 valid_acc: 98.250000
2021-01-27 07:02:55,358 best_acc: 98.310000
2021-01-27 07:02:57,400 epoch 1499 lr 5.483112e-08

Thank you! Lukas

linnanwang commented 3 years ago

Hello Lukas,

Thanks for your interest! We are double-checking our codes and the public release to see if there is any discrepancy. Will let you update shortly, stay tuned. Thank you.

linnanwang commented 3 years ago

Hello Lukas,

Here is an initial results from our investigation. Bascially, the network you're running is different from the one reported in the paper.

Current network in Github is:

Genotype(normal=[('skip_connect', 0), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 1), ('max_pool_3x3', 2), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 3), ('sep_conv_5x5', 4), ('skip_connect', 3), ('max_pool_3x3', 0), ('skip_connect', 3), ('sep_conv_3x3', 1)], normal_concat=[5, 6, 7, 8], reduce=[('skip_connect', 0), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 1), ('max_pool_3x3', 2), ('skip_connect', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 3), ('sep_conv_5x5', 4), ('skip_connect', 3), ('max_pool_3x3', 0), ('skip_connect', 3), ('sep_conv_3x3', 1)], reduce_concat=[5, 6, 7, 8])

The actual Model is: Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])

The main difference between two models are 1). the operation space, the new one has a "dil_conv_5x5" operation which is not appear in the current netowork. 2) the current model shares same architecture in normal cell and reduce cell, but our new model has different architectures in normal and reduce cell.

Therefore, you can train our new model with same script but only simply change the line 118 in train.py to

genotype = Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])

We will change the release soon...

lukeN86 commented 3 years ago

Thank you, i've started trainng with the above architecture, I will keep you posted.

save path: checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16
Experiment dir : checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16
02/05 09:33:13 AM gpu device = 0
02/05 09:33:13 AM args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=600, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]
Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
train from the scratch
model init params values: tensor(100545.7500, device='cuda:0')
02/05 09:33:16 AM param size = 53.277834MB
do AutoAugment!
Files already downloaded and verified
cur_epoch is 0

linnanwang commented 3 years ago

Please note the updated architecture have a reduce cell. We're also running on our end, and will update shortly.

lukeN86 commented 3 years ago

Hello,

i've run the training with the suggested architecture for 600 epochs, and the accuracy remained virtually the same. Perhaps I am missing something else too?

 net = eval(args.arch)
    print(net)
    code = gen_code_from_list(net, node_num=int((len(net) / 4)))
    # genotype = translator([code, code], max_node=int((len(net) / 4)))
    genotype = Genotype(
        normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2),
                ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5],
        reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2),
                ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
    print(genotype)

Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 2), ('sep_conv_3x3', 0), ('sep_conv_3x3', 2), ('dil_conv_5x5', 0)], normal_concat=[2, 3, 4, 5], reduce=[('sep_conv_3x3', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 3)], reduce_concat=[2, 3, 4, 5])
train from the scratch
model init params values: tensor(100545.7500, device='cuda:0')
02/05 09:33:16 AM param size = 53.277834MB
do AutoAugment!
Files already downloaded and verified
cur_epoch is 0
02/05 09:33:18 AM epoch 0 lr 2.499966e-02

...

cur_epoch is 599
02/12 02:44:55 PM epoch 599 lr 0.000000e+00
...
02/12 03:02:58 PM valid_acc: 98.140000
current best acc is 98.25
02/12 03:02:58 PM best_acc: 98.250000
saved to: trained.pt

Thank you Lukas

linnanwang commented 3 years ago

Hang on there, we're getting similar results. Will need finetune after the 600 epochs. Keep you posted soon.

linnanwang commented 3 years ago

Hello Luke,

I'm no longer affiliated with FB, that's why taking us a bit longer to reproduce this results as we only have 1 V100 shared within the department. Please hang on there and we're working on it. Will update you once we will make an improvement. Thanks.

aoiang commented 3 years ago

Hello Luke,

We are working on fine-tuning our model. So far, we load the checkpoint and set the drop_path_prob to 0.25. Then, we re-train the checkpoint 100 epochs(you can simply set the epochs to 700). Then, the best accuracy of the model would improve around 0.15. We will update you once we make an improvement. Thank you.

linnanwang commented 3 years ago

@lukeN86 Since we have limited GPUs, it will be great if you can finetune the model with 98.5 top-1 on your side with the above procedures. Thank you.

linnanwang commented 3 years ago

A quick update, we have fine-tune the model to 98.67 after following the above steps, and still working on it.

linnanwang commented 3 years ago

Okay, finally we have fine-tuned the model to 98.82, and please follow the procedures above, and this really need a lot of patience. Thanks.

AwesomeLemon commented 3 years ago

Hi, I'm a bit confused about the final procedure of getting to 98.82: could you please write it here step-by-step?

linnanwang commented 3 years ago

@AwesomeLemon Can you please run the training code to see the result first?

adamas-v commented 3 years ago

Hi,

firstly, thank you for great work and for sharing your code!

I'm trying to run train the best architecture on CIFAR-10, yet I cannot seem to get the reported numbers, I guess I must be missing something. I tried both 1500 epochs as well as 600 as you suggested in another ticket, but still the accuracy seems lower than 99%. Is there perhaps another hyperparameter setting I need to adjust ?

python train.py --auxiliary --batch_size=32 --init_ch=128 --layer=24 --arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]' --model_ema --model-ema-decay 0.9999 --auto_augment --epochs 600

`2021-01-28 03:44:52,831 gpu device = 0
2021-01-28 03:44:52,831 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=600, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-600-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
2021-01-28 03:44:56,852 param size = 44.591498MB
2021-01-28 03:44:59,182 epoch 0 lr 2.499966e-02
...
2021-02-03 00:52:40,405 valid_acc: 98.410000
2021-02-03 00:52:40,405 best_acc: 98.490000
2021-02-03 00:52:42,316 epoch 599 lr 0.000000e+00

or for 1500 epochs respectivelly: ```

2021-01-11 10:40:08,600 gpu device = 0
2021-01-11 10:40:08,600 args = Namespace(arch='[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]', auto_augment=True, auxiliary=True, auxiliary_weight=0.4, batch_size=32, cutout=False, cutout_length=16, data='../data', drop_path_prob=0.2, epochs=1500, exp_path='exp/cifar10', gpu=0, grad_clip=5, init_ch=128, layers=24, lr=0.025, model_ema=True, model_ema_decay=0.9999, model_ema_force_cpu=False, model_path='saved_models', momentum=0.9, report_freq=50, save='checkpoints/auto_aug-1500-[2, 2, 0, 2, 1, 2, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 0, 3, 4, 3, 0, 3, 1]-128-True-0.9999-0.2-24-16', seed=0, track_ema=False, wd=0.0003)
2021-01-11 10:40:11,371 param size = 44.591498MB
2021-01-11 10:40:12,744 epoch 0 lr 2.499995e-02

2021-01-27 07:02:55,357 valid_acc: 98.250000
2021-01-27 07:02:55,358 best_acc: 98.310000
2021-01-27 07:02:57,400 epoch 1499 lr 5.483112e-08

Thank you! Lukas hi lukeN86 ,do you get the result of 98.49 with ema_model ?

facebookresearch / LaMCTS

Reproducing CIFAR10 experiment #10