Open ngaloppo opened 8 years ago
Note also that because convnet-benchmarks is using old-style prototxt, the NNPackConvolutionParameter message is not parsed correctly by caffe (e.g. to set algorithm: FFT_16x16)
If the reported timings are per image (not per batch), then I take my comments about not being reproduce back. However, it could be nice to add some notes to the README on how to enable NNPACK as per my instructions above:
engine: NNPACK
to relevant layersutil
@ngaloppo The timings are per batch. The parameters of the networks are from cpu
branch of convnet-benchmarks
. Please see #2 for Maratyszcza/NNPACK#9 for details. Backward pass is not supported and is ignored.
@Maratyszcza thanks for the link to the cpu
branch of convnet-benchmarks
. That's useful.
Regarding the convolution algorithm: did you have to convert the prototxt to the new format to pick anything aside from AUTO
(to produce the different columns in the benchmark result table).
I changed the defaults in Caffe.proto and recompiled Caffe for each algorithm
So this is for an Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz. I'm getting pretty good results, in the lines of what your numbers are, except for conv2. I'm not sure if that's related to the difference between i5 and i7, or something else. See below:
output_alexnet_openblas_auto.log
500:I0429 09:44:58.612160 1298 caffe.cpp:369] conv2/5x5_s1 forward: 246.104 ms.
506:I0429 09:44:58.612181 1298 caffe.cpp:369] conv3/3x3_s1 forward: 132.718 ms.
510:I0429 09:44:58.612195 1298 caffe.cpp:369] conv4/3x3_s1 forward: 167.838 ms.
514:I0429 09:44:58.612208 1298 caffe.cpp:369] conv5/3x3_s1 forward: 116.058 ms.
output_alexnet_openblas_fft_8x8.log
500:I0429 09:44:42.281219 1285 caffe.cpp:369] conv2/5x5_s1 forward: 381.438 ms.
506:I0429 09:44:42.281240 1285 caffe.cpp:369] conv3/3x3_s1 forward: 298.638 ms.
510:I0429 09:44:42.281255 1285 caffe.cpp:369] conv4/3x3_s1 forward: 376.771 ms.
514:I0429 09:44:42.281268 1285 caffe.cpp:369] conv5/3x3_s1 forward: 268.936 ms.
output_alexnet_mkl_fft_16x16.log
500:I0429 09:43:10.256397 1210 caffe.cpp:369] conv2/5x5_s1 forward: 218.977 ms.
506:I0429 09:43:10.256419 1210 caffe.cpp:369] conv3/3x3_s1 forward: 50.3675 ms.
510:I0429 09:43:10.256433 1210 caffe.cpp:369] conv4/3x3_s1 forward: 61.5275 ms.
514:I0429 09:43:10.256446 1210 caffe.cpp:369] conv5/3x3_s1 forward: 42.647 ms.
output_alexnet_openblas_fft_16x16.log
500:I0429 09:44:18.571959 1272 caffe.cpp:369] conv2/5x5_s1 forward: 245.449 ms.
506:I0429 09:44:18.571979 1272 caffe.cpp:369] conv3/3x3_s1 forward: 131.85 ms.
510:I0429 09:44:18.571992 1272 caffe.cpp:369] conv4/3x3_s1 forward: 168.14 ms.
514:I0429 09:44:18.572006 1272 caffe.cpp:369] conv5/3x3_s1 forward: 115.662 ms.
output_alexnet_mkl_auto.log
500:I0429 09:43:35.835698 1243 caffe.cpp:369] conv2/5x5_s1 forward: 191.967 ms.
506:I0429 09:43:35.835721 1243 caffe.cpp:369] conv3/3x3_s1 forward: 48.5533 ms.
510:I0429 09:43:35.835734 1243 caffe.cpp:369] conv4/3x3_s1 forward: 60.2299 ms.
514:I0429 09:43:35.835747 1243 caffe.cpp:369] conv5/3x3_s1 forward: 44.0158 ms.
output_alexnet_openblas_vanilla.log
440:I0429 09:44:02.130417 1257 caffe.cpp:369] conv2/5x5_s1 forward: 346.287 ms.
446:I0429 09:44:02.130439 1257 caffe.cpp:369] conv3/3x3_s1 forward: 176.933 ms.
450:I0429 09:44:02.130452 1257 caffe.cpp:369] conv4/3x3_s1 forward: 257.505 ms.
454:I0429 09:44:02.130465 1257 caffe.cpp:369] conv5/3x3_s1 forward: 177.364 ms.
output_alexnet_mkl_fft_8x8.log
500:I0429 09:43:24.372515 1227 caffe.cpp:369] conv2/5x5_s1 forward: 266.223 ms.
506:I0429 09:43:24.372536 1227 caffe.cpp:369] conv3/3x3_s1 forward: 100.112 ms.
510:I0429 09:43:24.372550 1227 caffe.cpp:369] conv4/3x3_s1 forward: 125.635 ms.
514:I0429 09:43:24.372565 1227 caffe.cpp:369] conv5/3x3_s1 forward: 87.0149 ms.
output_alexnet_mkl_vanilla.log
440:I0429 09:42:58.014567 1187 caffe.cpp:369] conv2/5x5_s1 forward: 331.863 ms.
446:I0429 09:42:58.014590 1187 caffe.cpp:369] conv3/3x3_s1 forward: 118.221 ms.
450:I0429 09:42:58.014605 1187 caffe.cpp:369] conv4/3x3_s1 forward: 198.266 ms.
454:I0429 09:42:58.014619 1187 caffe.cpp:369] conv5/3x3_s1 forward: 126.541 ms.
@ngaloppo Do you use prototxt from convnet-benchmarks
? Specifications from other sources (e.g. Caffe model zoo) may have different image sizes or numbers of channers in hidden layers.
@Maratyszcza Yes, from the cpu
branch. I did convert those to the new prototxt format (using tools/upgrade_net_proto_text
(so that I could change the convolution algorithm without rebuilding), but that shouldn't have caused any topological changes.
How many threads are running here? Can you control the number of threads for NNPACK. Even when I am setting OMP_NUM_THREADS to 1, I can see multiple threads running in parallel in htop.
@anijain2305 NNPACK would use OMP_NUM_THREADS
, if the variable is set, or all virtual threads if it is not specified.
yes, also I got similar results as @ngaloppo reported.
openblas + fft16x16 in a i7 machine I0731 21:06:06.510483 12768 caffe.cpp:369] conv2/5x5_s1 forward: 311.694 ms. I0731 21:06:06.510542 12768 caffe.cpp:369] conv3/3x3_s1 forward: 162.253 ms. I0731 21:06:06.510582 12768 caffe.cpp:369] conv4/3x3_s1 forward: 452.496 ms. I0731 21:06:06.510622 12768 caffe.cpp:369] conv5/3x3_s1 forward: 169.524 ms.
@Maratyszcza hi, it's unfortunate that i also cannot get the result in the NNPACK README.md in my i7-4720HQ machine, i used --enable-psimd configuration and complied the latest NNPACK version. As for timing i choose nnpack-pr and modified a few lines of code to fit the new interface of NNPACK. But when i add engine: NNPACK inside conv_param the time of relevant convolution layers even become slower,backward is very fast because it not implement. I have tried some methods but still can't solve this problem, looking forward to your help, thanks. (i use prototxt from cpu branch of convnet-benchmark directly and use caffe time command for testing time)
@wangxi123 If you want to reproduce results from README, don't use --enable-psimd
options
@Maratyszcza well, i think, i just want to test the conv speedup compare to which is not adding engine: NNPACK in conv_param, but it seems that i don't get some speedup finally,i don't know whether i left out some necessary operations. That's OK, i will try again in another machine which has AVX2 instruction. and should i use the latest NNPACK with nnpack-pr ? I hope you can recommand a NNPACK version for me.
@wangxi123 When you add engine: NNPACK
, Caffe would use NNPACK implementation. If NNPACK is configured with --enable-psimd
, it would be a generic small-SIMD implementation using SSE2. If you configure NNPACK without --enable-psimd
option, it will use assembly implementation for AVX2 instruction set.
@Maratyszcza i'm pleased to see some speedup (~1.3x) in my machine which has AVX2 instruction set. But when i change the algorithm DEFAULT config in proto/caffe.proto and recompile caffe, it seems to have little change between AUTO and FFT_16x16 option , i'm so confused ...what's more, when i run WINOGRAD option , caffe will crash somehow, and i got a message Check failed: nnp_status_success == status (0 vs. 26),is that expected ? thank you for your patience to answer. ,
@wangxi123 WINOGRAD
algorithm is implemented only for 3x3 kernels. AUTO
will choose an algorithm automatically, among FFT, Winograd transform, and implicit GEMM.
@Maratyszcza yes, i got it. I modify conv2 for 3x3 kernels with pad 1 to test WINOGRAD algorithm, i get ~1.6x speedup in alexnet and ~2.2x speedup in overfeat in conv2~conv5 , it's amazimg. However, when i use the same prototxt to test FFT_8x8 and FFT_16x16 algorithm ,it seems to have no significant speedup with im2col+sgemm.what should i do for prototxt except adding engine:NNPACK? or what do i need to pay attention when i use fft algorithm.i'm very sorry to bother you so many times, i really need your help, thanks.
@wangxi123 In the current implementation of most convolution functions in NNPACK you need quite large batch size to get speedup (at least 128, better 256). No that it doesn't affect nnp_convolution_inference
function, which delivers good performance on batch size = 1 when image size is large.
I'm having trouble reproducing the performance numbers for AlexNet in the NNPACK README.md. I'm using the nnpack-pr branch here, and timing using the
caffe time
invocation as in the convnet-benchmark scripts.I'm using the prototxt from convnet-benchmark. I added
engine: NNPACK
to conv2-conv5 and double checked that NNPACK is being invoked.There are a few open issues: