clovaai / CutMix-PyTorch

Official Pytorch implementation of CutMix regularizer
MIT License
1.22k stars 159 forks source link

About the hyper-parameter alpha of mixup #38

Closed rederyang closed 3 years ago

rederyang commented 3 years ago

In the paper mixup: Beyond Empirical Risk Minimization, it seems that Mixup performs the best on ImageNet when the hyper-parameter alpha is between 0.2 and 0.4. For ResNet-50, they get 77.9 acuuracy on ImageNet when alpha equals 0.2 and trained for 200 epochs. But in this work, the result of mixup is reported when alpha equals 1.0, which leads to 77.42 accuracy. This might not be fair. In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0). Have you tried some related experiments and what do you think about it? Hope I express my opinion clearly. Looking forward to your reply!

hellbell commented 3 years ago

@rederxz Thank you for your constructive opinion. As pointed out in our paper, we tried two alphas (0.5 and 1.0) for mixup training and we choose alpha=1.0 because it shows better performance than 0.5. So we didn't try alpha below 0.5, but as you said, it would be worth finding the optimal alpha for mixup.

In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0).

Could you give more detail about this? such as the accuracy, training settings, and so on.

Have you tried some related experiments and what do you think about it?

As I remember, in our training settings, CutMix was always better than mixup for ResNet variants regardless of alpha values. However, for lightweight architectures like EfficientNet variants, mixup and CutMix shows similar performance gain.
So I think there should be a better strategy. Some recent works (e.g., https://arxiv.org/pdf/2012.12877.pdf) utilize mixup and CutMix at the same time for performance boosting.

rederyang commented 3 years ago

Thanks for your reply! Some of our experiments are still in progress. I will upload detailed results in a few days.

rederyang commented 3 years ago

Here are the results.

Our experiments:

model (Resolution) augmentation regularization batch size optimizer lr epochs lr_schedule wd acc Reference
ResNet_vd-50 160 ResizedCrop label smooth 0.1 mixup_batch alpha=0.2 256 * 4 SGD 0.1 * 4 200 Cosine 0.0001 78.58 /
ResNet_vd-50 160 ResizedCrop label smooth 0.1 mixup_batch alpha=1.0 256 * 4 SGD 0.1 * 4 200 Cosine 0.0001 77.55 /
ResNet_vd-50 160 ResizedCrop label smooth 0.1 cutmix_batch alpha=1.0 256 * 4 SGD 0.1 * 4 200 Cosine 0.0001 78.43 /
model (Resolution) augmentation regularization batch size optimizer lr epochs lr_schedule wd acc Reference
ResNet_vd-50 avd 160 ResizedCrop label smooth 0.1 cutmix_batch alpha=0.2 256 * 4 SGD 0.1 * 4 200 Cosine 0.0001 79.13 /
ResNet_vd-50 avd 160 ResizedCrop label smooth 0.1 cutmix_batch alpha=1.0 256 * 4 SGD 0.1 * 4 200 Cosine 0.0001 78.68 /
model (Resolution) augmentation regularization batch size optimizer lr epochs lr_schedule wd acc Reference
ResNet_vd-50 224 ResizedCrop mixup_batch alpha=0.2 256 * 4 SGD 0.1 * 4 300 Cosine 0.0001 79.00 /
ResNet_vd-50 224 ResizedCrop mixup_batch alpha=1.0 256 * 4 SGD 0.1 * 4 300 Cosine 0.0001 78.44 /
ResNet_vd-50 224 ResizedCrop cutmix_batch alpha=0.2 256 * 4 SGD 0.1 * 4 300 Cosine 0.0001 79.15 /
ResNet_vd-50 224 ResizedCrop cutmix_batch alpha=1.0 256 * 4 SGD 0.1 * 4 300 Cosine 0.0001 79.17 /

Results from PaddleClas

model (Resolution) augmentation regularization batch size optimizer lr epochs lr_schedule wd acc Reference
ResNet-50 224 ResizedCrop mixup_batch alpha=0.2 256 SGD 0.1 300 Cosine 0.0001 0.7828 page
ResNet-50 224 ResizedCrop cutmix_batch alpha=0.2 256 SGD 0.1 300 Cosine 0.0001 0.7839 page

Experiments in the paper of CutMix

model (Resolution) augmentation regularization batch size optimizer lr epochs lr_schedule wd acc Reference
ResNet-50 224 ResizedCrop mixup_batch alpha=1.0 256 SGD 0.1 300 Step 0.0001 0.7742 cutmix paper
ResNet-50 224 ResizedCrop cutmix_batch alpha=1.0 256 SGD 0.1 300 Step 0.0001 0.7860 cutmix paper

We can see that mixup's performence is better when alpha equals 0.2 than when alpha equals 1.0. Also, the gap between mixup and cutmix becomes smaller when alpha equals 0.2, which can also be confirmed by the results from paddleClas.

hellbell commented 3 years ago

Thank you for sharing the results! They are great experiments. Given your results, I agree that alpha should be 0.2 for mixup on ImageNet experiments. If there's a chance to revise or extend our paper, this information would be very useful :) At the same time, I'm curious about the CutMix result with alpha=1.0 on PaddleClas table. I guess its performance would be better than alpha=0.2. Thanks!

rederyang commented 3 years ago

I agree that alpha influences performances of cutmix and mixup, and may have greater impact on mixup on ImageNet. Thanks for your reply.:smiley: