About the hyper-parameter alpha of mixup

rederyang commented 3 years ago

In the paper mixup: Beyond Empirical Risk Minimization, it seems that Mixup performs the best on ImageNet when the hyper-parameter alpha is between 0.2 and 0.4. For ResNet-50, they get 77.9 acuuracy on ImageNet when alpha equals 0.2 and trained for 200 epochs. But in this work, the result of mixup is reported when alpha equals 1.0, which leads to 77.42 accuracy. This might not be fair. In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0). Have you tried some related experiments and what do you think about it? Hope I express my opinion clearly. Looking forward to your reply!

hellbell commented 3 years ago

@rederxz Thank you for your constructive opinion. As pointed out in our paper, we tried two alphas (0.5 and 1.0) for mixup training and we choose alpha=1.0 because it shows better performance than 0.5. So we didn't try alpha below 0.5, but as you said, it would be worth finding the optimal alpha for mixup.

In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0).

Could you give more detail about this? such as the accuracy, training settings, and so on.

Have you tried some related experiments and what do you think about it?

As I remember, in our training settings, CutMix was always better than mixup for ResNet variants regardless of alpha values. However, for lightweight architectures like EfficientNet variants, mixup and CutMix shows similar performance gain.
So I think there should be a better strategy. Some recent works (e.g., https://arxiv.org/pdf/2012.12877.pdf) utilize mixup and CutMix at the same time for performance boosting.

rederyang commented 3 years ago

Thanks for your reply! Some of our experiments are still in progress. I will upload detailed results in a few days.

rederyang commented 3 years ago

Here are the results.

Our experiments:

model (Resolution)	augmentation	regularization	batch size	optimizer	lr	epochs	lr_schedule	wd	acc	Reference
ResNet_vd-50 160	ResizedCrop	label smooth 0.1 mixup_batch alpha=0.2	256 * 4	SGD	0.1 * 4	200	Cosine	0.0001	78.58	/
ResNet_vd-50 160	ResizedCrop	label smooth 0.1 mixup_batch alpha=1.0	256 * 4	SGD	0.1 * 4	200	Cosine	0.0001	77.55	/
ResNet_vd-50 160	ResizedCrop	label smooth 0.1 cutmix_batch alpha=1.0	256 * 4	SGD	0.1 * 4	200	Cosine	0.0001	78.43	/

model (Resolution)	augmentation	regularization	batch size	optimizer	lr	epochs	lr_schedule	wd	acc	Reference
ResNet_vd-50 avd 160	ResizedCrop	label smooth 0.1 cutmix_batch alpha=0.2	256 * 4	SGD	0.1 * 4	200	Cosine	0.0001	79.13	/
ResNet_vd-50 avd 160	ResizedCrop	label smooth 0.1 cutmix_batch alpha=1.0	256 * 4	SGD	0.1 * 4	200	Cosine	0.0001	78.68	/

model (Resolution)	augmentation	regularization	batch size	optimizer	lr	epochs	lr_schedule	wd	acc	Reference
ResNet_vd-50 224	ResizedCrop	mixup_batch alpha=0.2	256 * 4	SGD	0.1 * 4	300	Cosine	0.0001	79.00	/
ResNet_vd-50 224	ResizedCrop	mixup_batch alpha=1.0	256 * 4	SGD	0.1 * 4	300	Cosine	0.0001	78.44	/
ResNet_vd-50 224	ResizedCrop	cutmix_batch alpha=0.2	256 * 4	SGD	0.1 * 4	300	Cosine	0.0001	79.15	/
ResNet_vd-50 224	ResizedCrop	cutmix_batch alpha=1.0	256 * 4	SGD	0.1 * 4	300	Cosine	0.0001	79.17	/

Results from PaddleClas

model (Resolution)	augmentation	regularization	batch size	optimizer	lr	epochs	lr_schedule	wd	acc	Reference
ResNet-50 224	ResizedCrop	mixup_batch alpha=0.2	256	SGD	0.1	300	Cosine	0.0001	0.7828	page
ResNet-50 224	ResizedCrop	cutmix_batch alpha=0.2	256	SGD	0.1	300	Cosine	0.0001	0.7839	page

Experiments in the paper of CutMix

model (Resolution)	augmentation	regularization	batch size	optimizer	lr	epochs	lr_schedule	wd	acc	Reference
ResNet-50 224	ResizedCrop	mixup_batch alpha=1.0	256	SGD	0.1	300	Step	0.0001	0.7742	cutmix paper
ResNet-50 224	ResizedCrop	cutmix_batch alpha=1.0	256	SGD	0.1	300	Step	0.0001	0.7860	cutmix paper

We can see that mixup's performence is better when alpha equals 0.2 than when alpha equals 1.0. Also, the gap between mixup and cutmix becomes smaller when alpha equals 0.2, which can also be confirmed by the results from paddleClas.

hellbell commented 3 years ago

Thank you for sharing the results! They are great experiments. Given your results, I agree that alpha should be 0.2 for mixup on ImageNet experiments. If there's a chance to revise or extend our paper, this information would be very useful :) At the same time, I'm curious about the CutMix result with alpha=1.0 on PaddleClas table. I guess its performance would be better than alpha=0.2. Thanks!

rederyang commented 3 years ago

I agree that alpha influences performances of cutmix and mixup, and may have greater impact on mixup on ImageNet. Thanks for your reply.:smiley:

clovaai / CutMix-PyTorch