Reproducing the Results, and Questions

drcdr commented 5 years ago

I am trying to reproduce the results of PDARTS, which looks like it provides awesome performance, congratulations!

Everything here is CIFAR-10. I didn't make any significant source-code mods; all other arguments are the default based on the repository on Apr 30. (I did hard-code directory names.)

Here are the labels for what I ran (Windows-10, Pytorch-nightly from 4/30/2019, 2xTitanXP): 1) PDARTS: Just train, rerunning the (default) PDARTS genotype in genotypes.py:

python train_cifar.py --cutout
Final validation error: 2.76% (best 2.69%, epoch 533) (3.43M parameters)
Runtime: about 3 days

2) pdarts-BS64: Search and train, but using Batch-Size=64 since TitanXP is memory-limited.

python train_search.py --add_layers 6 --add_layers 12 --dropout_rate 0.1 --dropout_rate 0.4 --dropout_rate 0.7 --batch_size 64
Runtime: about 25 hours
place the result into genotypes.py. I used the first one, not the 5 after the 'Restricting skipconnect...'
This was: pdarts64 = Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 2), ('skip_connect', 0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('avg_pool_3x3', 1), ('skip_connect', 1), ('sep_conv_5x5', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 3)], reduce_concat=range(2, 6))
python train_cifar.py --arch=pdarts64 --cutout
Final validation error: 3.2% (best 3.15%, epoch 598) (2.8M parameters)
Runtime : about 3 days, 16 hours (final val error: 3.2%; best, 96.85%, epoch 598)

Some Questions 1) The difference between my 2.76% and your 2.5% seems significant. Any ideas why this might be? Are you reporting best-val-error, or val-error-epoch-599? Are you reporting the best error over multiple runs, or just one run; or, the mean over N runs (if so, what's N)? 2) What is the idea behind 'Restricting skipconnect'? 3) How do my timings compare with what you (or others) might be getting on TitanXP cards? Are there any optimizations you suggest? I may try train_cifar.py with only one GPU next. 4) Is the 600 epochs and Cosine Annealing absolutely necessary to achieve the advertised CIFAR-10 performance? I see DARTS and derived papers use this. It's fantastic to now have fast NAS, but when the train is 3x the search... 5) Do you expect BS>=128 (if memory was available) would improve PDARTS even more? I actually was able to run BS=96 on the first pass (the 'num_to_keep' loop), but not the second pass. What are your thoughts on having a different BS per pass? 6) I see you say you got search to run in 12 hours on a 1080-Ti, BS=64. That's twice as fast as what I got. What was your command line? 7) Just wondering, have you considered trying the Cosine Power Annealing approach? See https://arxiv.org/abs/1903.09900, equation 2, as well as the discussion there about the benefits.

Well, that's enough questions for now, I appreciate your time and consideration.

For reference, here is a plot of the learning rate and validation error for these two runs. The bold line is the result of filtfilt with a window filter of length 25.

Figure_1

198808xc commented 5 years ago

Hi @drcdr , thank you for your interests in our work and so many good questions. I will try to answer a few of them and Xin will later put comments on some technical details.

1/3. Will be answered by Xin. The difference between 2.76% and 2.50% is a bit significant indeed.

By restricting the number of skip-connects we can make the searched architecture more stable. This operation is parameter-free, so we expect too many of them can bring negative effects in the real training stage. BTW, the architecture you searched has 5 skip-connects, which is the main reason of its poor performance.

5/6. BS seems an important issue in both speed and stability. We will try to make more experiments as soon as possible. Currently we only ran on our V100 GPUs and estimated the time on 1080Ti, which seems less accurate. Also, there are some evidences we met that suggests the importance of BS. We will provide some solutions for 12GB GPUs later.

4/7. For fair comparison, we did not change this setting. Another information is that we only need 1 day (1 V100) to train CIFAR10/100 on a searched architecture. Maybe Xin knows more about why you need 3 days.

We are very welcome for your further questions and comments.

chenxin061 commented 5 years ago

Hi@drcdr, thanks for the comments. The following are some technical details of our experiments.

Thanks for noticing the --cutout term. As mentioned in the README file, you should add the term --auxiliary to enable an auxiliary loss tower. Besides, we use a single GPU to do the evaluation. As far as I know, you may get a different test accuracy when trained with more than one GPU. Our 2.50% test error is an average of 3 runs, among which the best one is 2.42%.
About 40 hours on a single P100 and 24 hours on a single V100. My suggestion is doing the training on a single GPU.
The 12 hours search time on 1080-Ti is our estimation according to the previous experiments. My colleague told me that he finished the search process with a 1080-Ti within 7 hours, which has been updated into the README file. The command line is exactly the same as which in the README file. I notice that your os is Windows, which is a significant difference between your and our environment.
I did not try the power cosine annealing but I tried a 1000 epochs cosine annealing, which further boosts the performance. It may answer the question of the necessity of the 600-epoch schedule.
Hope the above helpful.

drcdr commented 5 years ago

Hi @198808xc, @chenxin061 - thanks for the great, detailed responses. Based on this, I'll do some more investigation on my side, it may take a week or two - then, I'll follow up with what I find. Thanks

drcdr commented 5 years ago

I have a related question: how are you calculating the number of trainable parameters in the model? I wrote a quick utility, and I matched your 3.4M number for PDARTS when --auxiliary is False, but I get a higher number (3.91M) when --auxiliary is True:

Arch=   RESNET-50   # Parameters = 25557032
Arch=      PDARTS   Auxiliary=False  #Parameters = 3433798
Arch=      PDARTS   Auxiliary= True  #Parameters = 3910224
Arch=    pdarts64   Auxiliary=False  #Parameters = 2800270
Arch=    pdarts64   Auxiliary= True  #Parameters = 3276696

import torch
from torch import nn
from torchvision import models
import genotypes
import collections
from model import NetworkCIFAR as Network

# https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())  # all of the params
    #return sum(p.numel() for p in model.parameters() if p.requires_grad)  # the trainable params

# just for reference
a = models.resnet50(pretrained=False)
count = count_parameters(a)
print ('Arch=%12s   # Parameters = %d' % ('RESNET-50', count))

args = collections.namedtuple('Args', ['init_channels', 'layers', 'arch', 'auxiliary', 'save'])
args.init_channels = 36
args.layers = 20
args.save = '.'
CIFAR_CLASSES = 10

for arch in ['PDARTS', 'pdarts_bs64']:
    for auxiliary in [False, True]:
        genotype = eval("genotypes.%s" % arch)
        model = Network(args.init_channels, CIFAR_CLASSES, args.layers, auxiliary, genotype)
        count = count_parameters(model)
        print ('Arch=%12s   Auxiliary=%5s  #Parameters = %d' % (arch, auxiliary, count))

chenxin061 commented 5 years ago

@drcdr Yes if the auxiliary tower is included, the parameter count will be larger. However, the auxiliary tower is used for network training instead of testing. Therefore we do not take those extra parameters into consideration for the testing phase. Actually, you will get the same test accuracy without --auxiliary term. You need to modify some code lines to adjust the absence of --auxiliary term in the model loading part.

drcdr commented 5 years ago

well, for some reason PyTorch crashed at iteration #551, with --auxiliary. Trying to figure out if warm restarts can be easily implemented. Looks like just CosineAnnealingLR() and torch.optim.SGD() would be affected (as well as torch.load'ing the checkpoint, and setting up the model from the state_dict)?

chenxin061 commented 5 years ago

Yes, you can recover the training from the checkpoint saved in the --save path just like what you said above.

drcdr commented 5 years ago

@chenxin061 OK, here's an update (thanks for your feedback). Modifications:

train_cifar.py: optionally resume from a checkpoint; added CosineAnnealing support for this too; added support for single-GPU training
test.py: modified to ignore auxiliary_head* weights, if auxiliary == false

Results:

Due to --auxiliary, not enough memory for BS=128, I dropped down to BS=96,
I got better results: 2.56% final test error, but not 2.42% final test error that you got. It did reach 2.45% for a couple of epochs before, but crept up at the end. I don't think this is due to the resume-from-checkpoint.
Gray shading: means not done by me. White cells are based on results that I got.
I may add to this table later. Problems that I'd still like to figure out: why search seems so much slower on my machine; whether dropping the skip-connects will help, for the architectures I found. Might look into Cosine Power Annealing, too.

im1

chenxin061 commented 5 years ago

@drcdr I think the experimental results you got on evaluating CIFAR10 is acceptable.

For one reason, a different batch size may lead to a different set of optimal hyper-parameter, resulting in a slightly different performance.
Besides, it is quite common that the test accuracy ripples among different runs on CIFAR10. The average test error we got among 3 runs is 2.50% and the 2.56% test error you got for a single run is quite close to ours. The checkpoint we released with 2.42% test error is one of the best of our models.
For search cost:
My colleague updated the search speed from 7 hours to 11 hours on a single 1080Ti GPU. The result I attached in my previous reply is with a different configuration.
The evaluation speed on a single 1080Ti GPU is about 300s/epoch, resulting in a total cost of about 2 days. I guess the bottleneck of your system may be on the memory or CPU since the 1080Ti GPU and the Titan XP GPU seem to perform similar according to some previous report.

D-X-Y commented 5 years ago

Hi, @chenxin061 I'm reproducing your ImageNet results. I trained your model based on DARTS codes, here is my training log and model file: https://drive.google.com/open?id=1br4IPnHCV-zUHJkEGXPwXnsl6288yhFy , while the final accuracy is 73.92%. I double checked our codes, the difference is that you use the cosine decayed LR scheduler, while I use the StepLR following DARTS. I use batch size of 256, start LR from 0.1, and 8 GPUs. While you use 1024 batch size and start LR as 0.5. Did you try to train your model with StepLR scheduler, and how is the performance?

drcdr commented 5 years ago

@D-X-Y I haven't tried Imagenet training yet. Am I reading/understanding this right; did your 250 epoch Imagenet training take 11 days, using 8 GPUs?! Also, it looked like you used the PDARTS genotype, so I guess you were trying to see how your run compared to the 24.4% top-1 test error number? (Also, I guess your batch-size-per-GPU was only 32?)

chenxin061 commented 5 years ago

@D-X-Y We did not try the StepLR scheduler for the PDARTS genotype. The results reported in our paper were obtained with the linear scheduler, and we also obtained similar test accuracy with cosine scheduler. We are re-training the DARTS genotype with linear and cosine scheduler and will later report the test accuracy here and in the next version of our paper.

D-X-Y commented 5 years ago

@drcdr Yes, 8 GPUs, batch-size-per-GPU was 32. I'm trying to get 24.4% top-1 test error.

@chenxin061 Thanks for your reply and also look forward to your results. I will also try DARTS using your training strategy after NIPS ddl :)

chenxin061 commented 5 years ago

@D-X-Y @drcdr Sorry for the late reply. An update of results on ImageNet of DARTS: Cosine scheduler: top1/top5 test error 25.3%/7.8%, Linear scheduler: top1/top5 test error 25.4%/8.0%.

D-X-Y commented 5 years ago

@chenxin061 Thanks for your results! I'm also training DARTS and other NAS models with cosine scheduler.

Margrate commented 5 years ago

I got acc95.95% using PDARTS in genotypes.py without change anything. (GPU:Tesla_V100-SXM2-32G)

chenxin061 commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

Margrate commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

chenxin061 commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

I think there must be some hidden difference. The expected valid acc is about 97.50 with the correct setting. You can also refer to issue #9, where the retraining valid_acc reported in the issue reached 97.52% at epoch 557.

arash-vahdat commented 5 years ago

It seems the genotype PDARTS in this line is different than the one reported in figure 3(c).

Can you confirm that that the released genotype (above) was giving you 97.5%?

drcdr commented 5 years ago

@arash-vahdat For me, see the PDARTSAux96 line in the table above (from May16). My final error there was 2.56%, and the genotype that I was using was the following, which looks the same as what you are referencing:

PDARTS = Genotype(normal=[('skip_connect', 0), ('dil_conv_3x3', 1), ('skip_connect', 0),('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3), ('sep_conv_3x3',0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('sep_conv_5x5', 1), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('max_pool_3x3', 0), ('dil_conv_3x3', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 3)], reduce_concat=range(2, 6))

chenxin061 commented 5 years ago

@drcdr Thanks for the reproduction. @arash-vahdat The genotype in figure 3(c) is from another run for ablation study and not the same as the genotype in genotypes.py. In our experiment, the one in genotypes.py got an average test acc of 97.50% among 3 runs, while the best one is 97.58%.

chenxin061 / pdarts

Reproducing the Results, and Questions #7