Closed dongxiao92 closed 6 years ago
Do you use nnp_convolution_output
or nnp_convolution_inference
?
@Maratyszcza nnp_convolution_inference,because I notice this method is used in nnpack_convolution_layer of Caffe when batchsize=1.
If you based on nnpack_convolution_layer in Caffe, it lacks two important optimizations:
Thanks!I will check on that and update results.
@dongxiao92 Were you able to get better performance with these optimizations?
I try both optimizations.Using pre-allocation improves performance in all configs.Details are shown belown.
input/nchw | out-channel | kernel | stride | pad | time with pre-allocation |
---|---|---|---|---|---|
1 3 224 224 | 64 | 3 | 1 | 1 | 15.72 |
1 64 224 224 | 64 | 3 | 1 | 1 | 52.49 |
1 64 112 112 | 128 | 3 | 1 | 1 | 27.37 |
1 128 112 112 | 128 | 3 | 1 | 1 | 41.01 |
1 128 56 56 | 256 | 3 | 1 | 1 | 26.79 |
1 256 56 56 | 256 | 3 | 1 | 1 | 49.51 |
1 256 28 28 | 512 | 3 | 1 | 1 | 56.75 |
1 512 28 28 | 512 | 3 | 1 | 1 | 111.93 |
1 512 14 14 | 512 | 3 | 1 | 1 | 95.41 |
For pre-computation of transformed kernels,the results are wired.Times for all above configs are several microseconds which are 10000-100000x faster.I think the computation of transformation of kernels can not occupy such a large part so I'm checking if I do wrong.
Did you check the error code when computing transforms? For some algorithms pre-computing transforms is not supported.
@Maratyszcza I check the status of each calling of nnp_convolution_inference. If I understand right, pre-computation requires three times calling of nnp_convolution_inference for computing workspace size, computing transformed filters and computing convolution result respectiovely.
@dongxiao92 Actually, you would call it 4 times:
*workspace_size
)workspace_buffer
)@Maratyszcza Thank you very much for helping. I update results of nnpack.The above tow optimizations actually improve performance dramatically.
input/nchw | out-channel | kernel | stride | pad | origin | pre-allocation of workspace | pre-computation of transformed filters+pre-allocation of workspace |
---|---|---|---|---|---|---|---|
1 3 224 224 | 64 | 3 | 1 | 1 | 15.4307 | 15.72 | 15.4786 |
1 64 224 224 | 64 | 3 | 1 | 1 | 65.6776 | 52.49 | 51.873 |
1 64 112 112 | 128 | 3 | 1 | 1 | 32.6236 | 27.37 | 25.3044 |
1 128 112 112 | 128 | 3 | 1 | 1 | 51.7365 | 41.01 | 37.2813 |
1 128 56 56 | 256 | 3 | 1 | 1 | 41.03 | 26.79 | 19.0401 |
1 256 56 56 | 256 | 3 | 1 | 1 | 69.2396 | 49.51 | 35.8438 |
1 256 28 28 | 512 | 3 | 1 | 1 | 88.3922 | 56.75 | 28.4993 |
1 512 28 28 | 512 | 3 | 1 | 1 | 162.162 | 111.93 | 55.6859 |
1 512 14 14 | 512 | 3 | 1 | 1 | 146.921 | 95.41 | 39.7084 |
I test on i7-6700 and A72; nnp conv is poor performance
Log on arm A72: nnp convolution:using 2.2s caffe with Openblas: using 0.758s
caffe: M:256 N:9072 K:2400 alpha:1.000000 beta:0.000000 time:0.649014 M:256 N:9072 K:1 alpha:1.000000 beta:1.000000 time:0.005509 this_conv(im2col,sgemm,bias) using time:0.758716
$ ./bin/convolution-benchmark -ic 96 -oc 256 -is 141 251 -ks 5 5 --input-padding 2 --output-subsampling 2 2 -i 1 Batch size: 1 Input channels: 96 Output channels: 256 Input: 141x251 with implicit padding 2 Kernel: 5x5 Subsampling: 2x2 Algorithm: auto Threads: 6 Iterations: 1 Time: 2269.129 ms Input transform: 36.980 ms (1.6%) [2.7 GB/s] Kernel transform: 33.437 ms (1.5%) [0.1 GB/s] Output transform: 101.325 ms (4.5%) [2.4 GB/s] Block multiplication: 2097.109 ms (92.4%) Overhead: 0.278 ms (0.0%)
@wangshankun Strided convolution in NNPACK always use implicit GEMM algorithm, it doesn't benefit from FFT/Winograd. In fact, you'd likely get better perf by running it with stride = 1, and then doing subsampling.
I try to compare performance of NNPACK with caffe using openblas.I downloaded the latest version of NNAPCK from github and compiled NNPACK as the build guide.For caffe and openblas,I use the latest release version(0.2.20 for openblas ans 1.0 for caffe). Inference time on VGG of a signle image is recorded.NNPACK runs with pthreadpool of 4 and algorithm of auto. But suprisingly,caffe with openblas seems to show a much better result which is inconsistant with other results. Results and hardware details are shown bellow.
CPU:Intel(R) Xeon(R) CPU E7-4809 v3 @ 2.00GHz gcc version:4.8.5 Memory:32GB
Can any guys give me some advices about this result?Did I use NNPACK wrong?