Reproducing NNPACK numbers on SKL i5-6600K

ngaloppo commented 8 years ago

I'm having trouble reproducing the performance numbers for AlexNet in the NNPACK README.md. I'm using the nnpack-pr branch here, and timing using the caffe time invocation as in the convnet-benchmark scripts.

I'm using the prototxt from convnet-benchmark. I added engine: NNPACK to conv2-conv5 and double checked that NNPACK is being invoked.

There are a few open issues:

Are the reported timings for a single image inference, or batched mode? The convnet-benchmark scripts are set up to test batched mode (size 128)
Backward pass is not supported, backward timings are bogus. I'm assuming this is expected?
I tried setting OMP_NUM_THREADS=4 but there is no apparent performance difference

ngaloppo commented 8 years ago

Note also that because convnet-benchmarks is using old-style prototxt, the NNPackConvolutionParameter message is not parsed correctly by caffe (e.g. to set algorithm: FFT_16x16)

ngaloppo commented 8 years ago

If the reported timings are per image (not per batch), then I take my comments about not being reproduce back. However, it could be nice to add some notes to the README on how to enable NNPACK as per my instructions above:

add engine: NNPACK to relevant layers
convert old prototxt using the caffe conversion tool in util

Maratyszcza commented 8 years ago

@ngaloppo The timings are per batch. The parameters of the networks are from cpu branch of convnet-benchmarks. Please see #2 for Maratyszcza/NNPACK#9 for details. Backward pass is not supported and is ignored.

ngaloppo commented 8 years ago

@Maratyszcza thanks for the link to the cpu branch of convnet-benchmarks. That's useful.

Regarding the convolution algorithm: did you have to convert the prototxt to the new format to pick anything aside from AUTO (to produce the different columns in the benchmark result table).

Maratyszcza commented 8 years ago

I changed the defaults in Caffe.proto and recompiled Caffe for each algorithm

ngaloppo commented 8 years ago

So this is for an Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz. I'm getting pretty good results, in the lines of what your numbers are, except for conv2. I'm not sure if that's related to the difference between i5 and i7, or something else. See below:

output_alexnet_openblas_auto.log
500:I0429 09:44:58.612160  1298 caffe.cpp:369] conv2/5x5_s1     forward: 246.104 ms.
506:I0429 09:44:58.612181  1298 caffe.cpp:369] conv3/3x3_s1     forward: 132.718 ms.
510:I0429 09:44:58.612195  1298 caffe.cpp:369] conv4/3x3_s1     forward: 167.838 ms.
514:I0429 09:44:58.612208  1298 caffe.cpp:369] conv5/3x3_s1     forward: 116.058 ms.

output_alexnet_openblas_fft_8x8.log
500:I0429 09:44:42.281219  1285 caffe.cpp:369] conv2/5x5_s1     forward: 381.438 ms.
506:I0429 09:44:42.281240  1285 caffe.cpp:369] conv3/3x3_s1     forward: 298.638 ms.
510:I0429 09:44:42.281255  1285 caffe.cpp:369] conv4/3x3_s1     forward: 376.771 ms.
514:I0429 09:44:42.281268  1285 caffe.cpp:369] conv5/3x3_s1     forward: 268.936 ms.

output_alexnet_mkl_fft_16x16.log
500:I0429 09:43:10.256397  1210 caffe.cpp:369] conv2/5x5_s1     forward: 218.977 ms.
506:I0429 09:43:10.256419  1210 caffe.cpp:369] conv3/3x3_s1     forward: 50.3675 ms.
510:I0429 09:43:10.256433  1210 caffe.cpp:369] conv4/3x3_s1     forward: 61.5275 ms.
514:I0429 09:43:10.256446  1210 caffe.cpp:369] conv5/3x3_s1     forward: 42.647 ms.

output_alexnet_openblas_fft_16x16.log
500:I0429 09:44:18.571959  1272 caffe.cpp:369] conv2/5x5_s1     forward: 245.449 ms.
506:I0429 09:44:18.571979  1272 caffe.cpp:369] conv3/3x3_s1     forward: 131.85 ms.
510:I0429 09:44:18.571992  1272 caffe.cpp:369] conv4/3x3_s1     forward: 168.14 ms.
514:I0429 09:44:18.572006  1272 caffe.cpp:369] conv5/3x3_s1     forward: 115.662 ms.

output_alexnet_mkl_auto.log
500:I0429 09:43:35.835698  1243 caffe.cpp:369] conv2/5x5_s1     forward: 191.967 ms.
506:I0429 09:43:35.835721  1243 caffe.cpp:369] conv3/3x3_s1     forward: 48.5533 ms.
510:I0429 09:43:35.835734  1243 caffe.cpp:369] conv4/3x3_s1     forward: 60.2299 ms.
514:I0429 09:43:35.835747  1243 caffe.cpp:369] conv5/3x3_s1     forward: 44.0158 ms.

output_alexnet_openblas_vanilla.log
440:I0429 09:44:02.130417  1257 caffe.cpp:369] conv2/5x5_s1     forward: 346.287 ms.
446:I0429 09:44:02.130439  1257 caffe.cpp:369] conv3/3x3_s1     forward: 176.933 ms.
450:I0429 09:44:02.130452  1257 caffe.cpp:369] conv4/3x3_s1     forward: 257.505 ms.
454:I0429 09:44:02.130465  1257 caffe.cpp:369] conv5/3x3_s1     forward: 177.364 ms.

output_alexnet_mkl_fft_8x8.log
500:I0429 09:43:24.372515  1227 caffe.cpp:369] conv2/5x5_s1     forward: 266.223 ms.
506:I0429 09:43:24.372536  1227 caffe.cpp:369] conv3/3x3_s1     forward: 100.112 ms.
510:I0429 09:43:24.372550  1227 caffe.cpp:369] conv4/3x3_s1     forward: 125.635 ms.
514:I0429 09:43:24.372565  1227 caffe.cpp:369] conv5/3x3_s1     forward: 87.0149 ms.

output_alexnet_mkl_vanilla.log
440:I0429 09:42:58.014567  1187 caffe.cpp:369] conv2/5x5_s1     forward: 331.863 ms.
446:I0429 09:42:58.014590  1187 caffe.cpp:369] conv3/3x3_s1     forward: 118.221 ms.
450:I0429 09:42:58.014605  1187 caffe.cpp:369] conv4/3x3_s1     forward: 198.266 ms.
454:I0429 09:42:58.014619  1187 caffe.cpp:369] conv5/3x3_s1     forward: 126.541 ms.

Maratyszcza commented 8 years ago

@ngaloppo Do you use prototxt from convnet-benchmarks? Specifications from other sources (e.g. Caffe model zoo) may have different image sizes or numbers of channers in hidden layers.

ngaloppo commented 8 years ago

@Maratyszcza Yes, from the cpu branch. I did convert those to the new prototxt format (using tools/upgrade_net_proto_text (so that I could change the convolution algorithm without rebuilding), but that shouldn't have caused any topological changes.

anijain2305 commented 8 years ago

How many threads are running here? Can you control the number of threads for NNPACK. Even when I am setting OMP_NUM_THREADS to 1, I can see multiple threads running in parallel in htop.

Maratyszcza commented 8 years ago

@anijain2305 NNPACK would use OMP_NUM_THREADS, if the variable is set, or all virtual threads if it is not specified.

lolongcovas commented 8 years ago

yes, also I got similar results as @ngaloppo reported.

openblas + fft16x16 in a i7 machine I0731 21:06:06.510483 12768 caffe.cpp:369] conv2/5x5_s1 forward: 311.694 ms. I0731 21:06:06.510542 12768 caffe.cpp:369] conv3/3x3_s1 forward: 162.253 ms. I0731 21:06:06.510582 12768 caffe.cpp:369] conv4/3x3_s1 forward: 452.496 ms. I0731 21:06:06.510622 12768 caffe.cpp:369] conv5/3x3_s1 forward: 169.524 ms.

wangxi123 commented 8 years ago

@Maratyszcza hi, it's unfortunate that i also cannot get the result in the NNPACK README.md in my i7-4720HQ machine, i used --enable-psimd configuration and complied the latest NNPACK version. As for timing i choose nnpack-pr and modified a few lines of code to fit the new interface of NNPACK. But when i add engine: NNPACK inside conv_param the time of relevant convolution layers even become slower,backward is very fast because it not implement. I have tried some methods but still can't solve this problem, looking forward to your help, thanks. (i use prototxt from cpu branch of convnet-benchmark directly and use caffe time command for testing time)

Maratyszcza commented 8 years ago

@wangxi123 If you want to reproduce results from README, don't use --enable-psimd options

wangxi123 commented 8 years ago

@Maratyszcza well, i think, i just want to test the conv speedup compare to which is not adding engine: NNPACK in conv_param, but it seems that i don't get some speedup finally,i don't know whether i left out some necessary operations. That's OK, i will try again in another machine which has AVX2 instruction. and should i use the latest NNPACK with nnpack-pr ? I hope you can recommand a NNPACK version for me.

Maratyszcza commented 8 years ago

@wangxi123 When you add engine: NNPACK, Caffe would use NNPACK implementation. If NNPACK is configured with --enable-psimd, it would be a generic small-SIMD implementation using SSE2. If you configure NNPACK without --enable-psimd option, it will use assembly implementation for AVX2 instruction set.

wangxi123 commented 8 years ago

@Maratyszcza i'm pleased to see some speedup (~1.3x) in my machine which has AVX2 instruction set. But when i change the algorithm DEFAULT config in proto/caffe.proto and recompile caffe, it seems to have little change between AUTO and FFT_16x16 option , i'm so confused ...what's more, when i run WINOGRAD option , caffe will crash somehow, and i got a message Check failed: nnp_status_success == status (0 vs. 26),is that expected ? thank you for your patience to answer. ,

Maratyszcza commented 8 years ago

@wangxi123 WINOGRAD algorithm is implemented only for 3x3 kernels. AUTO will choose an algorithm automatically, among FFT, Winograd transform, and implicit GEMM.

wangxi123 commented 8 years ago

@Maratyszcza yes, i got it. I modify conv2 for 3x3 kernels with pad 1 to test WINOGRAD algorithm, i get ~1.6x speedup in alexnet and ~2.2x speedup in overfeat in conv2~conv5 , it's amazimg. However, when i use the same prototxt to test FFT_8x8 and FFT_16x16 algorithm ,it seems to have no significant speedup with im2col+sgemm.what should i do for prototxt except adding engine:NNPACK? or what do i need to pay attention when i use fft algorithm.i'm very sorry to bother you so many times, i really need your help, thanks.

Maratyszcza commented 8 years ago

@wangxi123 In the current implementation of most convolution functions in NNPACK you need quite large batch size to get speedup (at least 128, better 256). No that it doesn't affect nnp_convolution_inference function, which delivers good performance on batch size = 1 when image size is large.

Maratyszcza / caffe-nnpack

Reproducing NNPACK numbers on SKL i5-6600K #4