facebookresearch / mixup-cifar10

mixup: Beyond Empirical Risk Minimization
Other
1.16k stars 225 forks source link

ResNet architectures on Cifar10 do not follow Kaiming He #3

Closed kudkudak closed 6 years ago

kudkudak commented 6 years ago

I think Resnets models (Resnet-18, Resnet-110, etc.) do not match https://arxiv.org/abs/1603.05027. The reason is that Kaiming He used a different architecture (under the same name) for CIFAR-10 compared to ImageNet. In particular Resnet-18 should have ~0.3M params, while the one used in the repository seems to have ~10M. See Tab. 6 and bottom of page 7 from https://arxiv.org/abs/1512.03385:

image

This is not a bug of course, but it might be a bit misleading.

kudkudak commented 6 years ago

I am also not sure if the ResNet follows the ImageNet version. Before the softmax there should be a final ReLU -> BN: https://github.com/KaimingHe/resnet-1k-layers/blob/master/resnet-pre-act.lua#L118, which if I am looking correctly is absent here https://github.com/facebookresearch/mixup-cifar10/blob/master/models/resnet.py#L166.

PabloRR100 commented 6 years ago

Actually, the authors only use 3 layers for ResNets on CIFAR-10 and not 4. Also, there are no ResNet18, ResNet34... and so on for CIFAR-10. Insetead, there are ResNet20, ResNet32, ResNet44, ResNet56, ResNet110.

I am working with ensembles of ResNets on CIFAR-10 so I have an implementation on PyTorch for ResNets on CIFAR-10 here:

https://github.com/PabloRR100/ResNets_Ensembles/blob/master/resnetsCIFAR10.py

And here is a documentation I prepared for myself to understand the architecture: http://pabloruizruiz10.com/resources/CNNs/ResNet-on-CIFAR10.pdf

@kudkudak I would also like to know your thougths in case you find any mistake in my implementation

hongyi-zhang commented 6 years ago

One of the authors here :) Thank you both for raising the issue!

We are actually aware of the discrepancy with the architectures in the original ResNet paper. The one we use follows (https://github.com/kuangliu/pytorch-cifar) which is wider and built upon pre-activation residual blocks (https://arxiv.org/pdf/1603.05027.pdf), hence is called PreAct-ResNet-18 in our paper.

The main reason we choose the wider pre-activation version is its stronger performance. I am sorry if this has caused confusion for some readers. It would be good to have code and results (ERM and mixup) for the original architecture as well, but preferably they should not erase the current stuffs we have, as they are essential to reproduce our results reported in the mixup paper.

Thanks.

PabloRR100 commented 6 years ago

Hi @hongyi-zhang , thanks for the response!

I have gone through the paper on pre-activation residual blocks you shared, but I still don't see some points.

Thanks.

ynd commented 6 years ago

Hi @kudkudak,

Thanks for your careful read of the paper and code. Indeed, the Resnet paper (https://arxiv.org/abs/1603.05027) uses two slightly different architectures with the same name. The name refers to the number of layers in the network - a name that captures every implementation detail would be pretty complicated (especially when cross-validation can change these specifics, such as the width of the layer)

We use the same convention in our work, and we have made sure to run all our baselines. We appreciate that leaves some ambiguity and so we have provided code for reference and reproducibility of the paper. That's the main purpose of this repo, and changing the code in such a way now would be detrimental to this purpose.

kudkudak commented 6 years ago

Sure, thanks :).

śr., 29 sie 2018 o 23:17 Yann N. Dauphin notifications@github.com napisał(a):

Hi @kudkudak https://github.com/kudkudak,

Thanks for your careful read of the paper and code. Indeed, the Resnet paper (https://arxiv.org/abs/1603.05027) uses two slightly different architectures with the same name. The name refers to the number of layers in the network - a name that captures every implementation detail would be pretty complicated (especially when cross-validation can change these specifics, such as the width of the layer)

We use the same convention in our work, and we have made sure to run all our baselines. We appreciate that leaves some ambiguity and so we have provided code for reference and reproducibility of the paper. That's the main purpose of this repo, and changing the code in such a way now would be detrimental to this purpose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/mixup-cifar10/issues/3#issuecomment-417124862, or mute the thread https://github.com/notifications/unsubscribe-auth/ABf8aWJlvimyw2vOKLupaRVhT7J4XxIjks5uVxL3gaJpZM4V8u8B .