Closed rederyang closed 3 years ago
@rederxz Thank you for your constructive opinion. As pointed out in our paper, we tried two alphas (0.5 and 1.0) for mixup training and we choose alpha=1.0 because it shows better performance than 0.5. So we didn't try alpha below 0.5, but as you said, it would be worth finding the optimal alpha for mixup.
In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0).
Could you give more detail about this? such as the accuracy, training settings, and so on.
Have you tried some related experiments and what do you think about it?
As I remember, in our training settings, CutMix was always better than mixup for ResNet variants regardless of alpha values.
However, for lightweight architectures like EfficientNet variants, mixup and CutMix shows similar performance gain.
So I think there should be a better strategy. Some recent works (e.g., https://arxiv.org/pdf/2012.12877.pdf) utilize mixup and CutMix at the same time for performance boosting.
Thanks for your reply! Some of our experiments are still in progress. I will upload detailed results in a few days.
Here are the results.
model (Resolution) | augmentation | regularization | batch size | optimizer | lr | epochs | lr_schedule | wd | acc | Reference |
---|---|---|---|---|---|---|---|---|---|---|
ResNet_vd-50 160 | ResizedCrop | label smooth 0.1 mixup_batch alpha=0.2 | 256 * 4 | SGD | 0.1 * 4 | 200 | Cosine | 0.0001 | 78.58 | / |
ResNet_vd-50 160 | ResizedCrop | label smooth 0.1 mixup_batch alpha=1.0 | 256 * 4 | SGD | 0.1 * 4 | 200 | Cosine | 0.0001 | 77.55 | / |
ResNet_vd-50 160 | ResizedCrop | label smooth 0.1 cutmix_batch alpha=1.0 | 256 * 4 | SGD | 0.1 * 4 | 200 | Cosine | 0.0001 | 78.43 | / |
model (Resolution) | augmentation | regularization | batch size | optimizer | lr | epochs | lr_schedule | wd | acc | Reference |
---|---|---|---|---|---|---|---|---|---|---|
ResNet_vd-50 avd 160 | ResizedCrop | label smooth 0.1 cutmix_batch alpha=0.2 | 256 * 4 | SGD | 0.1 * 4 | 200 | Cosine | 0.0001 | 79.13 | / |
ResNet_vd-50 avd 160 | ResizedCrop | label smooth 0.1 cutmix_batch alpha=1.0 | 256 * 4 | SGD | 0.1 * 4 | 200 | Cosine | 0.0001 | 78.68 | / |
model (Resolution) | augmentation | regularization | batch size | optimizer | lr | epochs | lr_schedule | wd | acc | Reference |
---|---|---|---|---|---|---|---|---|---|---|
ResNet_vd-50 224 | ResizedCrop | mixup_batch alpha=0.2 | 256 * 4 | SGD | 0.1 * 4 | 300 | Cosine | 0.0001 | 79.00 | / |
ResNet_vd-50 224 | ResizedCrop | mixup_batch alpha=1.0 | 256 * 4 | SGD | 0.1 * 4 | 300 | Cosine | 0.0001 | 78.44 | / |
ResNet_vd-50 224 | ResizedCrop | cutmix_batch alpha=0.2 | 256 * 4 | SGD | 0.1 * 4 | 300 | Cosine | 0.0001 | 79.15 | / |
ResNet_vd-50 224 | ResizedCrop | cutmix_batch alpha=1.0 | 256 * 4 | SGD | 0.1 * 4 | 300 | Cosine | 0.0001 | 79.17 | / |
model (Resolution) | augmentation | regularization | batch size | optimizer | lr | epochs | lr_schedule | wd | acc | Reference |
---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 224 | ResizedCrop | mixup_batch alpha=0.2 | 256 | SGD | 0.1 | 300 | Cosine | 0.0001 | 0.7828 | page |
ResNet-50 224 | ResizedCrop | cutmix_batch alpha=0.2 | 256 | SGD | 0.1 | 300 | Cosine | 0.0001 | 0.7839 | page |
model (Resolution) | augmentation | regularization | batch size | optimizer | lr | epochs | lr_schedule | wd | acc | Reference |
---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 224 | ResizedCrop | mixup_batch alpha=1.0 | 256 | SGD | 0.1 | 300 | Step | 0.0001 | 0.7742 | cutmix paper |
ResNet-50 224 | ResizedCrop | cutmix_batch alpha=1.0 | 256 | SGD | 0.1 | 300 | Step | 0.0001 | 0.7860 | cutmix paper |
We can see that mixup's performence is better when alpha equals 0.2 than when alpha equals 1.0. Also, the gap between mixup and cutmix becomes smaller when alpha equals 0.2, which can also be confirmed by the results from paddleClas.
Thank you for sharing the results! They are great experiments.
Given your results, I agree that alpha should be 0.2 for mixup on ImageNet experiments. If there's a chance to revise or extend our paper, this information would be very useful :)
At the same time, I'm curious about the CutMix result with alpha=1.0 on PaddleClas
table. I guess its performance would be better than alpha=0.2.
Thanks!
I agree that alpha influences performances of cutmix and mixup, and may have greater impact on mixup on ImageNet. Thanks for your reply.:smiley:
In the paper mixup: Beyond Empirical Risk Minimization, it seems that Mixup performs the best on ImageNet when the hyper-parameter alpha is between 0.2 and 0.4. For ResNet-50, they get 77.9 acuuracy on ImageNet when alpha equals 0.2 and trained for 200 epochs. But in this work, the result of mixup is reported when alpha equals 1.0, which leads to 77.42 accuracy. This might not be fair. In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0). Have you tried some related experiments and what do you think about it? Hope I express my opinion clearly. Looking forward to your reply!