Performance not so good on armv7 cpu

knsong commented 7 years ago

@Maratyszcza can you give me some hint about: in which cases nnpack may have a better performance compared with im2col+sgemm using openblas/eigen on armv7 cpu? I also got the similar result as @conansherry 's(my net architeture is: input 60x60, stack of conv5x5, conv1x1, conv3x3 etc, stride == 1 ) and I'm wondering why in details fast algorithms in NNPACK seems to be inferior to openblas/eigen in this case.

And how to understand your comment in issue #39

When the number of channels on the input to convolution is small, the operation is similar to outer product: it is intrinsically memory bound, and fast algorithms in NNPACK do not help with performance.

Why would fast algorithms in NNPACK be memory bound when the number of channels on the input to convolution is small and thus be inferior to openblas/eigen? I think in this case im2col+sgemm using openblas/eigen will also need to perform a sgemm operation similar to outer product and be memory bound, but it is faster. What slows down nnpack here?

I must have missed something and need to hack into nnpack more thoroughly. Anyway, any little advice will be of great help. Thanks.

Maratyszcza commented 7 years ago

NNPACK convolution (inference mode) performs best when:

Number of input and output channels is large (64+).
Input size is large, e.g. 512x512.
Kernel size is at least 3x3. 1x1 convolutions do not perform well, in the future, I will add direct convolution algorithm specifically for this case. 5x5 and larger kernels without stride are nearly always faster than SGEMM-based convolution.

Fast convolution in NNPACK consists of Fast Fourier/Winograd Transforms and GEMM-like operations. GEMM-like operations are compute-bound and FFT/WT are bandwidth-bound. When image size and number of channels is large, GEMM-like operations in NNPACK dominate the runtime, and because NNPACK overall does fewer FLOPs than direct/SGEMM-based convolution, performs better overall. When number of input channels is small, the GEMM-like operations in NNPACK are a small fraction of runtime, and algorithmic speedup on these parts is not enough to compensate the cost of transforms.

knsong commented 7 years ago

Thanks a lot for your answer. It' quite clear now.

Maratyszcza / NNPACK

Performance not so good on armv7 cpu #46