Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.68k stars 316 forks source link

nnp_convolution_output much slower than nnp_convolution_inference #155

Closed jiecaoyu closed 6 years ago

jiecaoyu commented 6 years ago

Hi, I am testing NNPACK on a Raspberry Pi 3 b+ with a 4-core A53 Arm CPU, and found nnp_convolution_output much slower than nnp_convolution_inference (around 4~5x slower). Could you give some insights on why nnp_convolution_output is so slow? Thanks!

I tried nnp_convolution_output and got:

$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 443.646 ms
Input transform: 3.923 ms (0.9%) [0.7 GB/s]
Kernel transform: 44.700 ms (10.1%) [0.4 GB/s]
Output transform: 7.864 ms (1.8%) [0.3 GB/s]
Block multiplication: 386.999 ms (87.2%) [0.5 GFLOPS]
Overhead: 0.160 ms (0.0%)

Then I tried nnp_convolution_inference and got:

$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m inference -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 89.602 ms
Input transform: 2.885 ms (3.2%) [0.9 GB/s]
Kernel transform: 33.768 ms (37.7%) [0.6 GB/s]
Output transform: 3.966 ms (4.4%) [0.6 GB/s]
Block multiplication: 48.942 ms (54.6%) [4.3 GFLOPS]
Overhead: 0.042 ms (0.0%)

As you can see, the main difference is from the block multiplication. So I make some change to nnp_convolution_output to let it use the same kernel as nnp_convolution_inference:

diff --git a/src/convolution-output.c b/src/convolution-output.c
index 1522cfb..d772c95 100644
--- a/src/convolution-output.c
+++ b/src/convolution-output.c
@@ -386,8 +386,8 @@ static enum nnp_status compute_fast_convolution_output(
                                                                matrix_multiplication_context.full_gemm = nnp_hwinfo.cxgemm.cX_conjb_upto_mr_x_nr;
                                                        }
                                                } else {
-                                                       matrix_multiplication_context.fast_gemm = nnp_hwinfo.sxgemm.only_mr_x_nr;
-                                                       matrix_multiplication_context.full_gemm = nnp_hwinfo.sxgemm.upto_mr_x_nr;
+                                                       matrix_multiplication_context.fast_gemm = nnp_hwinfo.hxgemm.only_mr_x_nr;
+                                                       matrix_multiplication_context.full_gemm = nnp_hwinfo.hxgemm.upto_mr_x_nr;
                                                }
                                                pthreadpool_compute_2d_tiled(threadpool,
                                                        (pthreadpool_function_2d_tiled_t) compute_matrix_multiplication,

Then the computation peformance is improved:

$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 297.317 ms
Input transform: 4.166 ms (1.4%) [0.6 GB/s]
Kernel transform: 46.298 ms (15.6%) [0.4 GB/s]
Output transform: 8.054 ms (2.7%) [0.3 GB/s]
Block multiplication: 238.641 ms (80.3%) [0.9 GFLOPS]
Overhead: 0.159 ms (0.1%)

but still much slower than nnp_convolution_inference.

jiecaoyu commented 6 years ago

I think this is because the nnp_convolution_output is designed for processing a larger batch size. I listed some results here:

./bin/convolution-benchmark -b <Batch-Size> -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch-Size Time (ms) Estimated Relative Time to Inference
1 297.317 3.32x
2 320.964 1.79x
4 493.790 1.37x
8 669.373 0.93x
16 1127.133 0.79x

When Batch size > 8, nnp_convolution_output gets a win.

jiecaoyu commented 6 years ago

However, I think it might still be necessary to change the gemm kernel used in nnp_convolution_output to support Arm backend.