Open drcdr opened 5 years ago
Hi @drcdr , thank you for your interests in our work and so many good questions. I will try to answer a few of them and Xin will later put comments on some technical details.
1/3. Will be answered by Xin. The difference between 2.76% and 2.50% is a bit significant indeed.
5/6. BS seems an important issue in both speed and stability. We will try to make more experiments as soon as possible. Currently we only ran on our V100 GPUs and estimated the time on 1080Ti, which seems less accurate. Also, there are some evidences we met that suggests the importance of BS. We will provide some solutions for 12GB GPUs later.
4/7. For fair comparison, we did not change this setting. Another information is that we only need 1 day (1 V100) to train CIFAR10/100 on a searched architecture. Maybe Xin knows more about why you need 3 days.
We are very welcome for your further questions and comments.
Hi@drcdr, thanks for the comments. The following are some technical details of our experiments.
--cutout
term. As mentioned in the README file, you should add the term --auxiliary
to enable an auxiliary loss tower. Besides, we use a single GPU to do the evaluation. As far as I know, you may get a different test accuracy when trained with more than one GPU. Our 2.50% test error is an average of 3 runs, among which the best one is 2.42%. Hi @198808xc, @chenxin061 - thanks for the great, detailed responses. Based on this, I'll do some more investigation on my side, it may take a week or two - then, I'll follow up with what I find. Thanks
I have a related question: how are you calculating the number of trainable parameters in the model? I wrote a quick utility, and I matched your 3.4M number for PDARTS when --auxiliary is False, but I get a higher number (3.91M) when --auxiliary is True:
Arch= RESNET-50 # Parameters = 25557032
Arch= PDARTS Auxiliary=False #Parameters = 3433798
Arch= PDARTS Auxiliary= True #Parameters = 3910224
Arch= pdarts64 Auxiliary=False #Parameters = 2800270
Arch= pdarts64 Auxiliary= True #Parameters = 3276696
import torch
from torch import nn
from torchvision import models
import genotypes
import collections
from model import NetworkCIFAR as Network
# https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def count_parameters(model):
return sum(p.numel() for p in model.parameters()) # all of the params
#return sum(p.numel() for p in model.parameters() if p.requires_grad) # the trainable params
# just for reference
a = models.resnet50(pretrained=False)
count = count_parameters(a)
print ('Arch=%12s # Parameters = %d' % ('RESNET-50', count))
args = collections.namedtuple('Args', ['init_channels', 'layers', 'arch', 'auxiliary', 'save'])
args.init_channels = 36
args.layers = 20
args.save = '.'
CIFAR_CLASSES = 10
for arch in ['PDARTS', 'pdarts_bs64']:
for auxiliary in [False, True]:
genotype = eval("genotypes.%s" % arch)
model = Network(args.init_channels, CIFAR_CLASSES, args.layers, auxiliary, genotype)
count = count_parameters(model)
print ('Arch=%12s Auxiliary=%5s #Parameters = %d' % (arch, auxiliary, count))
@drcdr Yes if the auxiliary tower is included, the parameter count will be larger. However, the auxiliary tower is used for network training instead of testing. Therefore we do not take those extra parameters into consideration for the testing phase. Actually, you will get the same test accuracy without --auxiliary
term. You need to modify some code lines to adjust the absence of --auxiliary
term in the model loading part.
well, for some reason PyTorch crashed at iteration #551, with --auxiliary. Trying to figure out if warm restarts can be easily implemented. Looks like just CosineAnnealingLR() and torch.optim.SGD() would be affected (as well as torch.load'ing the checkpoint, and setting up the model from the state_dict)?
Yes, you can recover the training from the checkpoint saved in the --save
path just like what you said above.
@chenxin061 OK, here's an update (thanks for your feedback). Modifications:
auxiliary == false
Results:
--auxiliary
, not enough memory for BS=128, I dropped down to BS=96,@drcdr I think the experimental results you got on evaluating CIFAR10 is acceptable.
Hi, @chenxin061 I'm reproducing your ImageNet results. I trained your model based on DARTS codes, here is my training log and model file: https://drive.google.com/open?id=1br4IPnHCV-zUHJkEGXPwXnsl6288yhFy , while the final accuracy is 73.92%. I double checked our codes, the difference is that you use the cosine decayed LR scheduler, while I use the StepLR following DARTS. I use batch size of 256, start LR from 0.1, and 8 GPUs. While you use 1024 batch size and start LR as 0.5. Did you try to train your model with StepLR scheduler, and how is the performance?
@D-X-Y I haven't tried Imagenet training yet. Am I reading/understanding this right; did your 250 epoch Imagenet training take 11 days, using 8 GPUs?! Also, it looked like you used the PDARTS genotype, so I guess you were trying to see how your run compared to the 24.4% top-1 test error number? (Also, I guess your batch-size-per-GPU was only 32?)
@D-X-Y We did not try the StepLR scheduler for the PDARTS genotype. The results reported in our paper were obtained with the linear scheduler, and we also obtained similar test accuracy with cosine scheduler. We are re-training the DARTS genotype with linear and cosine scheduler and will later report the test accuracy here and in the next version of our paper.
@drcdr Yes, 8 GPUs, batch-size-per-GPU was 32. I'm trying to get 24.4% top-1 test error.
@chenxin061 Thanks for your reply and also look forward to your results. I will also try DARTS using your training strategy after NIPS ddl :)
@D-X-Y @drcdr Sorry for the late reply. An update of results on ImageNet of DARTS: Cosine scheduler: top1/top5 test error 25.3%/7.8%, Linear scheduler: top1/top5 test error 25.4%/8.0%.
@chenxin061 Thanks for your results! I'm also training DARTS and other NAS models with cosine scheduler.
I got acc95.95% using PDARTS in genotypes.py without change anything. (GPU:Tesla_V100-SXM2-32G)
@Margrate Maybe you missed option terms --cutout
and/or --auxiliary
according to the result.
@Margrate Maybe you missed option terms
--cutout
and/or--auxiliary
according to the result.
I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%
@Margrate Maybe you missed option terms
--cutout
and/or--auxiliary
according to the result.I run it again by adding option term --cutout and --auxiliary. Just got acc 97.01%
I think there must be some hidden difference. The expected valid acc is about 97.50 with the correct setting. You can also refer to issue #9, where the retraining valid_acc reported in the issue reached 97.52% at epoch 557.
It seems the genotype PDARTS in this line is different than the one reported in figure 3(c).
Can you confirm that that the released genotype (above) was giving you 97.5%?
@arash-vahdat For me, see the PDARTSAux96 line in the table above (from May16). My final error there was 2.56%, and the genotype that I was using was the following, which looks the same as what you are referencing:
PDARTS = Genotype(normal=[('skip_connect', 0), ('dil_conv_3x3', 1), ('skip_connect', 0),('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('sep_conv_3x3', 3), ('sep_conv_3x3',0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('sep_conv_5x5', 1), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('max_pool_3x3', 0), ('dil_conv_3x3', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 3)], reduce_concat=range(2, 6))
@drcdr Thanks for the reproduction.
@arash-vahdat The genotype in figure 3(c) is from another run for ablation study and not the same as the genotype in genotypes.py
. In our experiment, the one in genotypes.py
got an average test acc of 97.50% among 3 runs, while the best one is 97.58%.
I am trying to reproduce the results of PDARTS, which looks like it provides awesome performance, congratulations!
Everything here is CIFAR-10. I didn't make any significant source-code mods; all other arguments are the default based on the repository on Apr 30. (I did hard-code directory names.)
Here are the labels for what I ran (Windows-10, Pytorch-nightly from 4/30/2019, 2xTitanXP): 1) PDARTS: Just train, rerunning the (default) PDARTS genotype in genotypes.py:
python train_cifar.py --cutout
2) pdarts-BS64: Search and train, but using Batch-Size=64 since TitanXP is memory-limited.
python train_search.py --add_layers 6 --add_layers 12 --dropout_rate 0.1 --dropout_rate 0.4 --dropout_rate 0.7 --batch_size 64
pdarts64 = Genotype(normal=[('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 2), ('skip_connect', 0), ('dil_conv_5x5', 4)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 0), ('avg_pool_3x3', 1), ('skip_connect', 1), ('sep_conv_5x5', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 2), ('avg_pool_3x3', 0), ('dil_conv_3x3', 3)], reduce_concat=range(2, 6))
python train_cifar.py --arch=pdarts64 --cutout
Some Questions 1) The difference between my 2.76% and your 2.5% seems significant. Any ideas why this might be? Are you reporting best-val-error, or val-error-epoch-599? Are you reporting the best error over multiple runs, or just one run; or, the mean over N runs (if so, what's N)? 2) What is the idea behind 'Restricting skipconnect'? 3) How do my timings compare with what you (or others) might be getting on TitanXP cards? Are there any optimizations you suggest? I may try train_cifar.py with only one GPU next. 4) Is the 600 epochs and Cosine Annealing absolutely necessary to achieve the advertised CIFAR-10 performance? I see DARTS and derived papers use this. It's fantastic to now have fast NAS, but when the train is 3x the search... 5) Do you expect BS>=128 (if memory was available) would improve PDARTS even more? I actually was able to run BS=96 on the first pass (the 'num_to_keep' loop), but not the second pass. What are your thoughts on having a different BS per pass? 6) I see you say you got search to run in 12 hours on a 1080-Ti, BS=64. That's twice as fast as what I got. What was your command line? 7) Just wondering, have you considered trying the Cosine Power Annealing approach? See https://arxiv.org/abs/1903.09900, equation 2, as well as the discussion there about the benefits.
Well, that's enough questions for now, I appreciate your time and consideration.
For reference, here is a plot of the learning rate and validation error for these two runs. The bold line is the result of
filtfilt
with a window filter of length 25.