Closed ildoonet closed 5 years ago
@ildoonet
- How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.
We use pytorch 1.0.0, Tesla P40 GPUs. The paper's experiments were conducted on our cloud system (NSML). So I recently tested again our code on local machine for CIFAR100 and ImageNet using this repo, and I got slightly lower performance on CIFAR100 (top-1 error 14.5~14.6 as similar to your report) but got better performance on ImageNet (top-1 error 21.4). One possible reason would be the difference between the cloud system and the local machines. We note that the results (top-1 error 14.5 on CIFAR100) still much better than the important baselines (cutout, mixup, etc). In the camera-ready version of our paper, we might update the performance to 14.5 on CIFAR100 and 21.4 on ImageNet for better reproducibility using local machines.
- Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.
As you can see in the code, we choose the best validation accuracy.
Thanks!
@hellbell Thanks, I guess that this reproducibility issue is not from the environment.
I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance. When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.
Anyway, thanks for the clarification.
@ildoonet I agree with your reply at some points and it is worth to see the final performance (or, converged performance) comparisons. But our paper also reports the best performance of other algorithms for fair-comparison by re-implementing. Only a few methods which we cannot reproduce were reported by their original paper's scores. Anyway, thank you for the constructive comments!
@ildoonet For your clarification and further discussion, So I re-open this issue.
The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).
I newly ran our code on CIFAR100 three times, and we got
at 300 epoch | best acc | |
---|---|---|
try1 | 14.78 | 14.23 |
try2 | 15.44 | 14.5 |
try3 | 15.00 | 14.68 |
average | 15.07 | 14.47 |
Also, for ImageNet-1K, we got
at 300 epoch | best | |
---|---|---|
try1 | 21.20 | 21.19 |
try2 | 21.61 | 21.61 |
try3 | 21.40 | 21.40 |
average | 21.403 | 21.400 |
Interestingly, we got the best performance near the last epoch of the training.
I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance.
For ImageNet-1K task, many methods report their best validation accuracy during training because they cannot approach the 'test dataset'. Of course, we will add the statement to our final paper We report the best performance during training.
for the clarification.
We evaluated on CIFAR datasets using the same evaluation strategy. And we try our best to reproduce the baselines (mixup, cutout, and so on) and report their best
performance for fair-comparison.
But I have a question about what is the true performance
as you said.
I'm not sure the only way to represent the true performance
of the model is to report the last epoch's performance because the model could fluctuate at the end of the training and we cannot guarantee the model was converged
at the last epoch. Therefore, researchers usually train models and pick their best models by validating on the validation set.
In short, we choose the best model
to represent the performance
of the model and I think two approaches, selecting best model
or last model
, are both making sense to evaluate the trained models.
But your comments about the best and last models are very worth to consider for future work.
When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.
First, it is a nice work for the Fast AutoAugment!
In my guess, Fast AutoAugment and AutoAugment may use cosine learning rate decaying, so they are less fluctuating at the end of training, so the best performance and last performance would be similar.
I recently found CutMix + cosine learning rate
works well with CIFAR dataset, so we will report both the best
and last
performance when using cosine learning rate. I hope the gap between the best and the last models would be smaller than current training scheme.
I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.
But as you can say, true performance
of the model is hard to measure even if we have a held-out set only for testing.
Also, I trained with cosine learning rate for many models but there are similar gaps also.
Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.
@ildoonet Thank you for your reply. I do understand your concerns but I don't agree that mentioning the best performance is cheating. As I said, the best model surely can be treated to represent the performance of the method. The difference between the best and the last model is coming from the step decaying learning rate. In our case of using cosine learning rate on CIFAR100, the best and last model is almost the same (within +- 0.1% acc). All the experiments we re-implemented are conducted in the same experiment setting, the best model is selected for every other method, so there is no cheating and fair-comparison issues. Our best model's performance is not instantly peaked high value because we conducted several times and report the mean of the best performances.
I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.
But as you can say,
true performance
of the model is hard to measure even if we have a held-out set only for testing.Also, I trained with cosine learning rate for many models but there are similar gaps also.
Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.
If choosing the best performance is cheating, then many people are cheating. So I don't agree with your point. Rather, @hellbell is fairly correct. Thanks for the interesting work
'Cheating' is a rather harsh word. However, comparing the peak value indeed benefits oscillating and risky methods.
I deeply apologize for the misrepresentation of poor English and poor word choice. CutMix inspired me a lot and helped me a lot in my research.
I have ran your codes 5 times in the below environment.
The command I used is this :
For the baseline, I set cutmix_prob=0.0 not to use cutmix.
The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).
Also, I conducted an experiments with shakedrop (after codes for shakedrop regularization has been brought from 'https://github.com/owruby/shake-drop_pytorch').
Here you can see, top-1 accuracy you claimed on the paper can be achieved only by using 'maximum top-1 validation accuracy' during training, not by using 'converged top-1 validation accuracy' after training.
So, here is my questions.
How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.
Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.
Thanks. I look forward to hearing from you.