Closed jiecaoyu closed 6 years ago
I think this is because the nnp_convolution_output
is designed for processing a larger batch size. I listed some results here:
./bin/convolution-benchmark -b <Batch-Size> -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch-Size | Time (ms) | Estimated Relative Time to Inference |
---|---|---|
1 | 297.317 | 3.32x |
2 | 320.964 | 1.79x |
4 | 493.790 | 1.37x |
8 | 669.373 | 0.93x |
16 | 1127.133 | 0.79x |
When Batch size > 8
, nnp_convolution_output
gets a win.
However, I think it might still be necessary to change the gemm kernel used in nnp_convolution_output
to support Arm backend.
Hi, I am testing NNPACK on a Raspberry Pi 3 b+ with a 4-core A53 Arm CPU, and found
nnp_convolution_output
much slower thannnp_convolution_inference
(around 4~5x slower). Could you give some insights on whynnp_convolution_output
is so slow? Thanks!I tried nnp_convolution_output and got:
Then I tried nnp_convolution_inference and got:
As you can see, the main difference is from the block multiplication. So I make some change to
nnp_convolution_output
to let it use the same kernel asnnp_convolution_inference
:Then the computation peformance is improved:
but still much slower than
nnp_convolution_inference
.