Different settings of downsampling of CifarResNet.

D-X-Y / AutoDL-Projects

Automated deep learning algorithms implemented in PyTorch.

MIT License

1.57k stars 282 forks source link

Different settings of downsampling of CifarResNet. #69

Closed PkuDavidGuan closed 4 years ago

PkuDavidGuan commented 4 years ago

Hi, Thanks for your nice code. I wonder why the downsampling is different when stride==2 and inplanes != planes*self.expansion in https://github.com/D-X-Y/AutoDL-Projects/blob/967a000a33da12d1b10321a03702bdfdb7eb1130/lib/models/CifarResNet.py#L80. Specifically, You select the class Downsample when stride==2. My question is: 1. why add a avgpool layer before the 11 conv. I think setting the stride of the 11 conv to 2 is enough. 2. why the BN layer is ignored after the 1*1 conv.

D-X-Y commented 4 years ago

If you use 1-by-1 conv with stride=2, some features (3/4) will be directly dropped. Instead if you add a average pooling layer, all features will be considered. I think using or not using BN for this 1x1 conv is ok, will not effect the performance.

PkuDavidGuan commented 4 years ago

Thanks for the quick reply.

PkuDavidGuan commented 4 years ago

Sorry for bothering you. I have another question about the downsampling. When we get the pruned model with TAS, there may be some residual blocks having different numbers of input and output channels. For example, the original residual block has 16 input channels and 16 output channels, so the shortcut path is an identity function f(x)=x. After pruning, the block is pruned to 16 input channels and 15 output channels, and the shortcut path will add a conv2d(16, 15, kernel=1, stride=1). The added conv layer will increase a lot of FLOPs, maybe larger than the pruned FLOPs.

D-X-Y commented 4 years ago

Yes, indeed, this is a potential problem. One solution could be : do not add an additional conv2d instead you can just shortcut the first 15 channels and copy the last channel.

PkuDavidGuan commented 4 years ago

But the last channel may be an important channel. If so, shortcut the first 15 channels is not reasonable. I think this is a general problem for channel pruning, is there a common solution to tackle with the problem? (I'm new to the pruning area, hoping you could give me some advice.)

D-X-Y commented 4 years ago

I donot think 1/16 channels is that important and special to other 15/16 channels, DL will learn proper weights for a given architectures