Performance compared to Caffe with OpenBlas

dongxiao92 commented 6 years ago

I try to compare performance of NNPACK with caffe using openblas.I downloaded the latest version of NNAPCK from github and compiled NNPACK as the build guide.For caffe and openblas,I use the latest release version(0.2.20 for openblas ans 1.0 for caffe). Inference time on VGG of a signle image is recorded.NNPACK runs with pthreadpool of 4 and algorithm of auto. But suprisingly,caffe with openblas seems to show a much better result which is inconsistant with other results. Results and hardware details are shown bellow.

input/nchw	out-channel	kernel	stride	pad	time openblas	time NNPACK
1 3 224 224	64	3	1	1	5.5	15.4307
1 64 224 224	64	3	1	1	85.54	65.6776
1 64 112 112	128	3	1	1	22.98	32.6236
1 128 112 112	128	3	1	1	48.13	51.7365
1 128 56 56	256	3	1	1	10.95	41.03
1 256 56 56	256	3	1	1	23	69.2396
1 256 28 28	512	3	1	1	10.19	88.3922
1 512 28 28	512	3	1	1	21.06	162.162
1 512 14 14	512	3	1	1	12.16	146.921

CPU:Intel(R) Xeon(R) CPU E7-4809 v3 @ 2.00GHz gcc version:4.8.5 Memory:32GB

Can any guys give me some advices about this result?Did I use NNPACK wrong?

Maratyszcza commented 6 years ago

Do you use nnp_convolution_output or nnp_convolution_inference?

dongxiao92 commented 6 years ago

@Maratyszcza nnp_convolution_inference,because I notice this method is used in nnpack_convolution_layer of Caffe when batchsize=1.

Maratyszcza commented 6 years ago

If you based on nnpack_convolution_layer in Caffe, it lacks two important optimizations:

Pre-allocation of workspace buffers. This is described in Maratyszcza/NNPACK#75
Pre-computation of kernel transforms. See details in Maratyszcza/NNPACK#82

dongxiao92 commented 6 years ago

Thanks!I will check on that and update results.

Maratyszcza commented 6 years ago

@dongxiao92 Were you able to get better performance with these optimizations?

dongxiao92 commented 6 years ago

I try both optimizations.Using pre-allocation improves performance in all configs.Details are shown belown.

input/nchw	out-channel	kernel	stride	pad	time with pre-allocation
1 3 224 224	64	3	1	1	15.72
1 64 224 224	64	3	1	1	52.49
1 64 112 112	128	3	1	1	27.37
1 128 112 112	128	3	1	1	41.01
1 128 56 56	256	3	1	1	26.79
1 256 56 56	256	3	1	1	49.51
1 256 28 28	512	3	1	1	56.75
1 512 28 28	512	3	1	1	111.93
1 512 14 14	512	3	1	1	95.41

For pre-computation of transformed kernels,the results are wired.Times for all above configs are several microseconds which are 10000-100000x faster.I think the computation of transformation of kernels can not occupy such a large part so I'm checking if I do wrong.

Maratyszcza commented 6 years ago

Did you check the error code when computing transforms? For some algorithms pre-computing transforms is not supported.

dongxiao92 commented 6 years ago

@Maratyszcza I check the status of each calling of nnp_convolution_inference. If I understand right, pre-computation requires three times calling of nnp_convolution_inference for computing workspace size, computing transformed filters and computing convolution result respectiovely.

Maratyszcza commented 6 years ago

@dongxiao92 Actually, you would call it 4 times:

Compute the size of transformed kernels (it is returned in *workspace_size)
Pre-compute kernel transforms (they will be stored in workspace_buffer)
Compute workspace size for inference with pre-computed transforms
Do the inference with pre-computed kernel transforms and pre-allocated buffers

dongxiao92 commented 6 years ago

@Maratyszcza Thank you very much for helping. I update results of nnpack.The above tow optimizations actually improve performance dramatically.

input/nchw	out-channel	kernel	stride	pad	origin	pre-allocation of workspace	pre-computation of transformed filters+pre-allocation of workspace
1 3 224 224	64	3	1	1	15.4307	15.72	15.4786
1 64 224 224	64	3	1	1	65.6776	52.49	51.873
1 64 112 112	128	3	1	1	32.6236	27.37	25.3044
1 128 112 112	128	3	1	1	51.7365	41.01	37.2813
1 128 56 56	256	3	1	1	41.03	26.79	19.0401
1 256 56 56	256	3	1	1	69.2396	49.51	35.8438
1 256 28 28	512	3	1	1	88.3922	56.75	28.4993
1 512 28 28	512	3	1	1	162.162	111.93	55.6859
1 512 14 14	512	3	1	1	146.921	95.41	39.7084

wangshankun commented 6 years ago

I test on i7-6700 and A72； nnp conv is poor performance

Log on arm A72: nnp convolution:using 2.2s caffe with Openblas: using 0.758s

caffe: M:256 N:9072 K:2400 alpha:1.000000 beta:0.000000 time:0.649014 M:256 N:9072 K:1 alpha:1.000000 beta:1.000000 time:0.005509 this_conv(im2col,sgemm,bias) using time:0.758716

$ ./bin/convolution-benchmark -ic 96 -oc 256 -is 141 251 -ks 5 5 --input-padding 2 --output-subsampling 2 2 -i 1 Batch size: 1 Input channels: 96 Output channels: 256 Input: 141x251 with implicit padding 2 Kernel: 5x5 Subsampling: 2x2 Algorithm: auto Threads: 6 Iterations: 1 Time: 2269.129 ms Input transform: 36.980 ms (1.6%) [2.7 GB/s] Kernel transform: 33.437 ms (1.5%) [0.1 GB/s] Output transform: 101.325 ms (4.5%) [2.4 GB/s] Block multiplication: 2097.109 ms (92.4%) Overhead: 0.278 ms (0.0%)

Maratyszcza commented 6 years ago

@wangshankun Strided convolution in NNPACK always use implicit GEMM algorithm, it doesn't benefit from FFT/Winograd. In fact, you'd likely get better perf by running it with stride = 1, and then doing subsampling.

Maratyszcza / NNPACK

Performance compared to Caffe with OpenBlas #121