Maratyszcza / NNPACK

Acceleration package for neural networks on multi-core CPUs
BSD 2-Clause "Simplified" License
1.67k stars 316 forks source link

Performance compared to Caffe with OpenBlas #121

Closed dongxiao92 closed 6 years ago

dongxiao92 commented 6 years ago

I try to compare performance of NNPACK with caffe using openblas.I downloaded the latest version of NNAPCK from github and compiled NNPACK as the build guide.For caffe and openblas,I use the latest release version(0.2.20 for openblas ans 1.0 for caffe). Inference time on VGG of a signle image is recorded.NNPACK runs with pthreadpool of 4 and algorithm of auto. But suprisingly,caffe with openblas seems to show a much better result which is inconsistant with other results. Results and hardware details are shown bellow.

input/nchw out-channel kernel stride pad  time openblas time NNPACK  
1 3 224 224 64 3 1 1 5.5 15.4307
1 64 224 224 64 3 1 1 85.54 65.6776
1 64 112 112 128 3 1 1 22.98 32.6236
1 128 112 112 128 3 1 1 48.13 51.7365
1 128 56 56 256 3 1 1 10.95 41.03
1 256 56 56 256 3 1 1 23 69.2396
1 256 28 28 512 3 1 1 10.19 88.3922
1 512 28 28 512 3 1 1 21.06 162.162
1 512 14 14 512 3 1 1 12.16 146.921

CPU:Intel(R) Xeon(R) CPU E7-4809 v3 @ 2.00GHz gcc version:4.8.5 Memory:32GB

Can any guys give me some advices about this result?Did I use NNPACK wrong?

Maratyszcza commented 6 years ago

Do you use nnp_convolution_output or nnp_convolution_inference?

dongxiao92 commented 6 years ago

@Maratyszcza nnp_convolution_inference,because I notice this method is used in nnpack_convolution_layer of Caffe when batchsize=1.

Maratyszcza commented 6 years ago

If you based on nnpack_convolution_layer in Caffe, it lacks two important optimizations:

  1. Pre-allocation of workspace buffers. This is described in Maratyszcza/NNPACK#75
  2. Pre-computation of kernel transforms. See details in Maratyszcza/NNPACK#82
dongxiao92 commented 6 years ago

Thanks!I will check on that and update results.

Maratyszcza commented 6 years ago

@dongxiao92 Were you able to get better performance with these optimizations?

dongxiao92 commented 6 years ago

I try both optimizations.Using pre-allocation improves performance in all configs.Details are shown belown.

input/nchw out-channel kernel stride pad time with pre-allocation
1 3 224 224 64 3 1 1 15.72
1 64 224 224 64 3 1 1 52.49
1 64 112 112 128 3 1 1 27.37
1 128 112 112 128 3 1 1 41.01
1 128 56 56 256 3 1 1 26.79
1 256 56 56 256 3 1 1 49.51
1 256 28 28 512 3 1 1 56.75
1 512 28 28 512 3 1 1 111.93
1 512 14 14 512 3 1 1 95.41

For pre-computation of transformed kernels,the results are wired.Times for all above configs are several microseconds which are 10000-100000x faster.I think the computation of transformation of kernels can not occupy such a large part so I'm checking if I do wrong.

Maratyszcza commented 6 years ago

Did you check the error code when computing transforms? For some algorithms pre-computing transforms is not supported.

dongxiao92 commented 6 years ago

@Maratyszcza I check the status of each calling of nnp_convolution_inference. If I understand right, pre-computation requires three times calling of nnp_convolution_inference for computing workspace size, computing transformed filters and computing convolution result respectiovely.

Maratyszcza commented 6 years ago

@dongxiao92 Actually, you would call it 4 times:

  1. Compute the size of transformed kernels (it is returned in *workspace_size)
  2. Pre-compute kernel transforms (they will be stored in workspace_buffer)
  3. Compute workspace size for inference with pre-computed transforms
  4. Do the inference with pre-computed kernel transforms and pre-allocated buffers
dongxiao92 commented 6 years ago

@Maratyszcza Thank you very much for helping. I update results of nnpack.The above tow optimizations actually improve performance dramatically.

input/nchw out-channel kernel stride pad origin pre-allocation of workspace pre-computation of transformed filters+pre-allocation of workspace
1 3 224 224 64 3 1 1 15.4307 15.72 15.4786
1 64 224 224 64 3 1 1 65.6776 52.49 51.873
1 64 112 112 128 3 1 1 32.6236 27.37 25.3044
1 128 112 112 128 3 1 1 51.7365 41.01 37.2813
1 128 56 56 256 3 1 1 41.03 26.79 19.0401
1 256 56 56 256 3 1 1 69.2396 49.51 35.8438
1 256 28 28 512 3 1 1 88.3922 56.75 28.4993
1 512 28 28 512 3 1 1 162.162 111.93 55.6859
1 512 14 14 512 3 1 1 146.921 95.41 39.7084
wangshankun commented 6 years ago

I test on i7-6700 and A72; nnp conv is poor performance

Log on arm A72: nnp convolution:using 2.2s caffe with Openblas: using 0.758s

caffe: M:256 N:9072 K:2400 alpha:1.000000 beta:0.000000 time:0.649014 M:256 N:9072 K:1 alpha:1.000000 beta:1.000000 time:0.005509 this_conv(im2col,sgemm,bias) using time:0.758716

$ ./bin/convolution-benchmark -ic 96 -oc 256 -is 141 251 -ks 5 5 --input-padding 2 --output-subsampling 2 2 -i 1 Batch size: 1 Input channels: 96 Output channels: 256 Input: 141x251 with implicit padding 2 Kernel: 5x5 Subsampling: 2x2 Algorithm: auto Threads: 6 Iterations: 1 Time: 2269.129 ms Input transform: 36.980 ms (1.6%) [2.7 GB/s] Kernel transform: 33.437 ms (1.5%) [0.1 GB/s] Output transform: 101.325 ms (4.5%) [2.4 GB/s] Block multiplication: 2097.109 ms (92.4%) Overhead: 0.278 ms (0.0%)

Maratyszcza commented 6 years ago

@wangshankun Strided convolution in NNPACK always use implicit GEMM algorithm, it doesn't benefit from FFT/Winograd. In fact, you'd likely get better perf by running it with stride = 1, and then doing subsampling.