ShichenLiu / CondenseNet

CondenseNet: Light weighted CNN for mobile devices
MIT License
695 stars 132 forks source link

DenseNet-121 is faster than CondenseNet-74 (C=G=4) on GTX 1080 Ti #3

Open ivankreso opened 6 years ago

ivankreso commented 6 years ago

I compared the forward pass speed of the larger ImageNet model with DenseNet-121 and the latter actually works faster. After benchmarking my guess is that CondenseConv layer is the cause of the slowdown due to memory transfers in ShuffleLayer and torch.index_select. @ShichenLiu can you comment on this, did you get better performance compared to DenseNet-121 in your experiments?

ShichenLiu commented 6 years ago

Our model is mainly designed for mobile devices, on which the actual inference time highly correlates with the theoretical complexity. However, the group convolution and index/shuffle operations are not efficiently implemented on GPU.

lvdmaaten commented 6 years ago

GPUs tend to be memory-bound rather than compute-bound, in particular, for small models that require additional memory transfers such as ShuffleNets and CondenseNets. On mobile devices, embedded systems, etc. the ratio between compute (in FLOPS) and memory bandwidth is very different though: convnets tend to be compute-bound on such platforms. If you did the same comparison on such a platform, you would find that a CondenseNet is much faster than a DenseNet (see Table 5 of the paper for actual timing results on an ARM processor).

ivankreso commented 6 years ago

Thanks for clarification. I already suspected that is the reason after I measured time spent in bottleneck 1x1 layer and grouped 3x3 layer. Forward pass spends twice as much time in 1x1 compared to 3x3. I think there is a way to avoid additional memory transfers on GPUs if CUDNN implementation allows you to specify custom feature maps ordering after grouped convolution. I don't know if this feature is available in CUDNN but if I am correct then you could remove all feature shuffling ops.