Closed knsong closed 7 years ago
NNPACK convolution (inference mode) performs best when:
Fast convolution in NNPACK consists of Fast Fourier/Winograd Transforms and GEMM-like operations. GEMM-like operations are compute-bound and FFT/WT are bandwidth-bound. When image size and number of channels is large, GEMM-like operations in NNPACK dominate the runtime, and because NNPACK overall does fewer FLOPs than direct/SGEMM-based convolution, performs better overall. When number of input channels is small, the GEMM-like operations in NNPACK are a small fraction of runtime, and algorithmic speedup on these parts is not enough to compensate the cost of transforms.
Thanks a lot for your answer. It' quite clear now.
@Maratyszcza can you give me some hint about: in which cases nnpack may have a better performance compared with im2col+sgemm using openblas/eigen on armv7 cpu? I also got the similar result as @conansherry 's(my net architeture is: input 60x60, stack of conv5x5, conv1x1, conv3x3 etc, stride == 1 ) and I'm wondering why in details fast algorithms in NNPACK seems to be inferior to openblas/eigen in this case.
And how to understand your comment in issue #39
Why would fast algorithms in NNPACK be memory bound when the number of channels on the input to convolution is small and thus be inferior to openblas/eigen? I think in this case im2col+sgemm using openblas/eigen will also need to perform a
sgemm
operation similar to outer product and be memory bound, but it is faster. What slows down nnpack here?I must have missed something and need to hack into nnpack more thoroughly. Anyway, any little advice will be of great help. Thanks.