Reproducibility Issue - Githubissues

ildoonet commented 5 years ago

I have ran your codes 5 times in the below environment.

Two V100 GPUs
Python 3.6.7
PyTorch 1.0.0
Cuda 9.0

The command I used is this :

python train.py \
--net_type pyramidnet \
--dataset cifar100 \
--depth 200 \
--alpha 240 \
--batch_size 64 \
--lr 0.25 \
--expname PyraNet200 \
--epochs 300 \
--beta 1.0 \
--cutmix_prob 0.5 \
--no-verbose

For the baseline, I set cutmix_prob=0.0 not to use cutmix.

	Model & Augmentations	try1	try2	try3	try4	try5	Average
cutmix p=0.0	Pyramid200(Converged)	17.14	16.32	16.15	16.29	16.61	16.502
	Pyramid200(Best)	17.01	16.02	16.01	16.17	16.35	16.312
cutmix p=0.5	CutMix(Converged)	16.27	15.55	16.18	16.19	15.38	15.914
	CutMix(Best)	15.29	14.66	15.28	15.04	14.52	14.958

The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).

Also, I conducted an experiments with shakedrop (after codes for shakedrop regularization has been brought from 'https://github.com/owruby/shake-drop_pytorch').

		try1	try2	try3	try4	try5	Average
cutmix p=0.5	ShakeDrop+CutMix(Converged)	14.06	14	14.16	13.86	14	14.016
	ShakeDrop+CutMix(Best)	13.67	13.81	13.8	13.69	13.62	13.718

Here you can see, top-1 accuracy you claimed on the paper can be achieved only by using 'maximum top-1 validation accuracy' during training, not by using 'converged top-1 validation accuracy' after training.

So, here is my questions.

How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.
Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.

Thanks. I look forward to hearing from you.

hellbell commented 5 years ago

@ildoonet

How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.

We use pytorch 1.0.0, Tesla P40 GPUs. The paper's experiments were conducted on our cloud system (NSML). So I recently tested again our code on local machine for CIFAR100 and ImageNet using this repo, and I got slightly lower performance on CIFAR100 (top-1 error 14.5~14.6 as similar to your report) but got better performance on ImageNet (top-1 error 21.4). One possible reason would be the difference between the cloud system and the local machines. We note that the results (top-1 error 14.5 on CIFAR100) still much better than the important baselines (cutout, mixup, etc). In the camera-ready version of our paper, we might update the performance to 14.5 on CIFAR100 and 21.4 on ImageNet for better reproducibility using local machines.

Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.

As you can see in the code, we choose the best validation accuracy.

Thanks!

ildoonet commented 5 years ago

@hellbell Thanks, I guess that this reproducibility issue is not from the environment.

I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance. When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.

Anyway, thanks for the clarification.

hellbell commented 5 years ago

[Updated reply]

@ildoonet I agree with your reply at some points and it is worth to see the final performance (or, converged performance) comparisons. But our paper also reports the best performance of other algorithms for fair-comparison by re-implementing. Only a few methods which we cannot reproduce were reported by their original paper's scores. Anyway, thank you for the constructive comments!

hellbell commented 5 years ago

@ildoonet For your clarification and further discussion, So I re-open this issue.

The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).

I newly ran our code on CIFAR100 three times, and we got

	at 300 epoch	best acc
try1	14.78	14.23
try2	15.44	14.5
try3	15.00	14.68
average	15.07	14.47

Also, for ImageNet-1K, we got

	at 300 epoch	best
try1	21.20	21.19
try2	21.61	21.61
try3	21.40	21.40
average	21.403	21.400

Interestingly, we got the best performance near the last epoch of the training.

I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance.

For ImageNet-1K task, many methods report their best validation accuracy during training because they cannot approach the 'test dataset'. Of course, we will add the statement to our final paper We report the best performance during training. for the clarification. We evaluated on CIFAR datasets using the same evaluation strategy. And we try our best to reproduce the baselines (mixup, cutout, and so on) and report their best performance for fair-comparison.
But I have a question about what is the true performance as you said. I'm not sure the only way to represent the true performance of the model is to report the last epoch's performance because the model could fluctuate at the end of the training and we cannot guarantee the model was converged at the last epoch. Therefore, researchers usually train models and pick their best models by validating on the validation set. In short, we choose the best model to represent the performance of the model and I think two approaches, selecting best model or last model, are both making sense to evaluate the trained models. But your comments about the best and last models are very worth to consider for future work.

When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.

First, it is a nice work for the Fast AutoAugment! In my guess, Fast AutoAugment and AutoAugment may use cosine learning rate decaying, so they are less fluctuating at the end of training, so the best performance and last performance would be similar. I recently found CutMix + cosine learning rate works well with CIFAR dataset, so we will report both the best and last performance when using cosine learning rate. I hope the gap between the best and the last models would be smaller than current training scheme.

ildoonet commented 5 years ago

I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.

But as you can say, true performance of the model is hard to measure even if we have a held-out set only for testing.

Also, I trained with cosine learning rate for many models but there are similar gaps also.

Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.

hellbell commented 5 years ago

@ildoonet Thank you for your reply. I do understand your concerns but I don't agree that mentioning the best performance is cheating. As I said, the best model surely can be treated to represent the performance of the method. The difference between the best and the last model is coming from the step decaying learning rate. In our case of using cosine learning rate on CIFAR100, the best and last model is almost the same (within +- 0.1% acc). All the experiments we re-implemented are conducted in the same experiment setting, the best model is selected for every other method, so there is no cheating and fair-comparison issues. Our best model's performance is not instantly peaked high value because we conducted several times and report the mean of the best performances.

GuoleiSun commented 5 years ago

I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.

But as you can say, true performance of the model is hard to measure even if we have a held-out set only for testing.

Also, I trained with cosine learning rate for many models but there are similar gaps also.

Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.

If choosing the best performance is cheating, then many people are cheating. So I don't agree with your point. Rather, @hellbell is fairly correct. Thanks for the interesting work

JiyueWang commented 4 years ago

'Cheating' is a rather harsh word. However, comparing the peak value indeed benefits oscillating and risky methods.

ildoonet commented 4 years ago

I deeply apologize for the misrepresentation of poor English and poor word choice. CutMix inspired me a lot and helped me a lot in my research.

clovaai / CutMix-PyTorch

Reproducibility Issue #7