chenxin061 / pdarts

Codes for our paper "Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation"
Other
359 stars 83 forks source link

Reproducing the Results, and Questions #7

Open drcdr opened 5 years ago

drcdr commented 5 years ago

I am trying to reproduce the results of PDARTS, which looks like it provides awesome performance, congratulations!

Everything here is CIFAR-10. I didn't make any significant source-code mods; all other arguments are the default based on the repository on Apr 30. (I did hard-code directory names.)

Here are the labels for what I ran (Windows-10, Pytorch-nightly from 4/30/2019, 2xTitanXP): 1) PDARTS: Just train, rerunning the (default) PDARTS genotype in genotypes.py:

2) pdarts-BS64: Search and train, but using Batch-Size=64 since TitanXP is memory-limited.

Some Questions 1) The difference between my 2.76% and your 2.5% seems significant. Any ideas why this might be? Are you reporting best-val-error, or val-error-epoch-599? Are you reporting the best error over multiple runs, or just one run; or, the mean over N runs (if so, what's N)? 2) What is the idea behind 'Restricting skipconnect'? 3) How do my timings compare with what you (or others) might be getting on TitanXP cards? Are there any optimizations you suggest? I may try train_cifar.py with only one GPU next. 4) Is the 600 epochs and Cosine Annealing absolutely necessary to achieve the advertised CIFAR-10 performance? I see DARTS and derived papers use this. It's fantastic to now have fast NAS, but when the train is 3x the search... 5) Do you expect BS>=128 (if memory was available) would improve PDARTS even more? I actually was able to run BS=96 on the first pass (the 'num_to_keep' loop), but not the second pass. What are your thoughts on having a different BS per pass? 6) I see you say you got search to run in 12 hours on a 1080-Ti, BS=64. That's twice as fast as what I got. What was your command line? 7) Just wondering, have you considered trying the Cosine Power Annealing approach? See https://arxiv.org/abs/1903.09900, equation 2, as well as the discussion there about the benefits.

Well, that's enough questions for now, I appreciate your time and consideration.

For reference, here is a plot of the learning rate and validation error for these two runs. The bold line is the result of filtfilt with a window filter of length 25.

Figure_1

198808xc commented 5 years ago

Hi @drcdr , thank you for your interests in our work and so many good questions. I will try to answer a few of them and Xin will later put comments on some technical details.

1/3. Will be answered by Xin. The difference between 2.76% and 2.50% is a bit significant indeed.

  1. By restricting the number of skip-connects we can make the searched architecture more stable. This operation is parameter-free, so we expect too many of them can bring negative effects in the real training stage. BTW, the architecture you searched has 5 skip-connects, which is the main reason of its poor performance.

5/6. BS seems an important issue in both speed and stability. We will try to make more experiments as soon as possible. Currently we only ran on our V100 GPUs and estimated the time on 1080Ti, which seems less accurate. Also, there are some evidences we met that suggests the importance of BS. We will provide some solutions for 12GB GPUs later.

4/7. For fair comparison, we did not change this setting. Another information is that we only need 1 day (1 V100) to train CIFAR10/100 on a searched architecture. Maybe Xin knows more about why you need 3 days.

We are very welcome for your further questions and comments.

chenxin061 commented 5 years ago

Hi@drcdr, thanks for the comments. The following are some technical details of our experiments.

  1. Thanks for noticing the --cutout term. As mentioned in the README file, you should add the term --auxiliary to enable an auxiliary loss tower. Besides, we use a single GPU to do the evaluation. As far as I know, you may get a different test accuracy when trained with more than one GPU. Our 2.50% test error is an average of 3 runs, among which the best one is 2.42%.
  2. About 40 hours on a single P100 and 24 hours on a single V100. My suggestion is doing the training on a single GPU.
  3. The 12 hours search time on 1080-Ti is our estimation according to the previous experiments. My colleague told me that he finished the search process with a 1080-Ti within 7 hours, which has been updated into the README file. The command line is exactly the same as which in the README file. I notice that your os is Windows, which is a significant difference between your and our environment.
  4. I did not try the power cosine annealing but I tried a 1000 epochs cosine annealing, which further boosts the performance. It may answer the question of the necessity of the 600-epoch schedule.
    Hope the above helpful.
drcdr commented 5 years ago

Hi @198808xc, @chenxin061 - thanks for the great, detailed responses. Based on this, I'll do some more investigation on my side, it may take a week or two - then, I'll follow up with what I find. Thanks

drcdr commented 5 years ago

I have a related question: how are you calculating the number of trainable parameters in the model? I wrote a quick utility, and I matched your 3.4M number for PDARTS when --auxiliary is False, but I get a higher number (3.91M) when --auxiliary is True:

Arch=   RESNET-50   # Parameters = 25557032
Arch=      PDARTS   Auxiliary=False  #Parameters = 3433798
Arch=      PDARTS   Auxiliary= True  #Parameters = 3910224
Arch=    pdarts64   Auxiliary=False  #Parameters = 2800270
Arch=    pdarts64   Auxiliary= True  #Parameters = 3276696
import torch
from torch import nn
from torchvision import models
import genotypes
import collections
from model import NetworkCIFAR as Network

# https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())  # all of the params
    #return sum(p.numel() for p in model.parameters() if p.requires_grad)  # the trainable params

# just for reference
a = models.resnet50(pretrained=False)
count = count_parameters(a)
print ('Arch=%12s   # Parameters = %d' % ('RESNET-50', count))

args = collections.namedtuple('Args', ['init_channels', 'layers', 'arch', 'auxiliary', 'save'])
args.init_channels = 36
args.layers = 20
args.save = '.'
CIFAR_CLASSES = 10

for arch in ['PDARTS', 'pdarts_bs64']:
    for auxiliary in [False, True]:
        genotype = eval("genotypes.%s" % arch)
        model = Network(args.init_channels, CIFAR_CLASSES, args.layers, auxiliary, genotype)
        count = count_parameters(model)
        print ('Arch=%12s   Auxiliary=%5s  #Parameters = %d' % (arch, auxiliary, count))
chenxin061 commented 5 years ago

@drcdr Yes if the auxiliary tower is included, the parameter count will be larger. However, the auxiliary tower is used for network training instead of testing. Therefore we do not take those extra parameters into consideration for the testing phase. Actually, you will get the same test accuracy without --auxiliary term. You need to modify some code lines to adjust the absence of --auxiliary term in the model loading part.

drcdr commented 5 years ago

well, for some reason PyTorch crashed at iteration #551, with --auxiliary. Trying to figure out if warm restarts can be easily implemented. Looks like just CosineAnnealingLR() and torch.optim.SGD() would be affected (as well as torch.load'ing the checkpoint, and setting up the model from the state_dict)?

chenxin061 commented 5 years ago

Yes, you can recover the training from the checkpoint saved in the --save path just like what you said above.

drcdr commented 5 years ago

@chenxin061 OK, here's an update (thanks for your feedback). Modifications:

Results:

im1

chenxin061 commented 5 years ago

@drcdr I think the experimental results you got on evaluating CIFAR10 is acceptable.

D-X-Y commented 5 years ago

Hi, @chenxin061 I'm reproducing your ImageNet results. I trained your model based on DARTS codes, here is my training log and model file: https://drive.google.com/open?id=1br4IPnHCV-zUHJkEGXPwXnsl6288yhFy , while the final accuracy is 73.92%. I double checked our codes, the difference is that you use the cosine decayed LR scheduler, while I use the StepLR following DARTS. I use batch size of 256, start LR from 0.1, and 8 GPUs. While you use 1024 batch size and start LR as 0.5. Did you try to train your model with StepLR scheduler, and how is the performance?

drcdr commented 5 years ago

@D-X-Y I haven't tried Imagenet training yet. Am I reading/understanding this right; did your 250 epoch Imagenet training take 11 days, using 8 GPUs?! Also, it looked like you used the PDARTS genotype, so I guess you were trying to see how your run compared to the 24.4% top-1 test error number? (Also, I guess your batch-size-per-GPU was only 32?)

chenxin061 commented 5 years ago

@D-X-Y We did not try the StepLR scheduler for the PDARTS genotype. The results reported in our paper were obtained with the linear scheduler, and we also obtained similar test accuracy with cosine scheduler. We are re-training the DARTS genotype with linear and cosine scheduler and will later report the test accuracy here and in the next version of our paper.

D-X-Y commented 5 years ago

@drcdr Yes, 8 GPUs, batch-size-per-GPU was 32. I'm trying to get 24.4% top-1 test error.

@chenxin061 Thanks for your reply and also look forward to your results. I will also try DARTS using your training strategy after NIPS ddl :)

chenxin061 commented 5 years ago

@D-X-Y @drcdr Sorry for the late reply. An update of results on ImageNet of DARTS: Cosine scheduler: top1/top5 test error 25.3%/7.8%, Linear scheduler: top1/top5 test error 25.4%/8.0%.

D-X-Y commented 5 years ago

@chenxin061 Thanks for your results! I'm also training DARTS and other NAS models with cosine scheduler.

Margrate commented 5 years ago

I got acc95.95% using PDARTS in genotypes.py without change anything. (GPU:Tesla_V100-SXM2-32G)

chenxin061 commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

Margrate commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

chenxin061 commented 5 years ago

@Margrate Maybe you missed option terms --cutout and/or --auxiliary according to the result.

I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%

I think there must be some hidden difference. The expected valid acc is about 97.50 with the correct setting. You can also refer to issue #9, where the retraining valid_acc reported in the issue reached 97.52% at epoch 557.

arash-vahdat commented 4 years ago

It seems the genotype PDARTS in this line is different than the one reported in figure 3(c).

Can you confirm that that the released genotype (above) was giving you 97.5%?

drcdr commented 4 years ago

@arash-vahdat For me, see the PDARTSAux96 line in the table above (from May16). My final error there was 2.56%, and the genotype that I was using was the following, which looks the same as what you are referencing:

PDARTS = Genotype(normal=[('skip_connect', 0), ('dil_conv_3x3', 1), ('skip_connect', 0),('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3), ('sep_conv_3x3',0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('sep_conv_5x5', 1), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('max_pool_3x3', 0), ('dil_conv_3x3', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 3)], reduce_concat=range(2, 6))

chenxin061 commented 4 years ago

@drcdr Thanks for the reproduction. @arash-vahdat The genotype in figure 3(c) is from another run for ablation study and not the same as the genotype in genotypes.py. In our experiment, the one in genotypes.py got an average test acc of 97.50% among 3 runs, while the best one is 97.58%.