perf tuning - Githubissues

fengggli commented 5 years ago

Machine configure see: https://github.com/fengggli/gpu-computing-materials/issues/58 Note: default gcc4.8 doesn't support avx512. I built gcc7.3 am using gcc with -march=skylake-avx512

Testing batch size 256 of resnet8 forward/backward time for 10 iterations

baseline (gnu + -O3 + mkl-sequential 500ms)
with gnu avx512(540ms)
with icc avx512((538 ms)

Notes about intel-caffe

Two engines(mkl2017 or mkldnn)!

-- Detecting Intel(R) MKL: trying mklml_intel
-- Intel(R) MKL: include /home/lifen/Workspace/caffe/external/mkl/mklml_lnx_2019.0.1.20180928/include
-- Intel(R) MKL: lib /home/lifen/Workspace/caffe/external/mkl/mklml_lnx_2019.0.1.20180928/lib/libmklml_intel.so
-- OpenMP lib: /home/lifen/Workspace/caffe/external/mkl/mklml_lnx_2019.0.1.20180928/lib/libiomp5.so
-- VTune profiling environment is unset
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lifen/Workspace/caffe/external/mkldnn/build
MKLDNN_Build-prefix/src/MKLDNN_Build-stamp/MKLDNN_Build-build-out.log (END)

jit generator using Xbyak to generate vectorized code: external/mkldnn/src/src/cpu/jit_generator.hpp
example of optimized convolution layer: external/mkldnn/src/src/cpu/jit_avx512_common_conv_kernel.cpp
cache blocking: https://arxiv.org/pdf/1602.06709v1.pdf

refs

Original article
General perf tunning guide in xeon processors
This book chapter tooks about performance of two intel-caffe engines(mkl2017 mkdnn)
avx512 guide!
another info In Intel compilers, automatic vectorization is enabled at the default optimization level -O2, so no additional arguments are needed. In GCC, to enable automatic vectorization, use the additional argument -O3. Additionally, to vectorize transcendental functions with GCC, -ffast-math may be needed.

fengggli commented 5 years ago

intel-caffe-thrd-1 (intelcaffe/r001hs)

fengggli commented 5 years ago

intel-caffe-thrd-4 (intelcaffe/r002hs)

fengggli commented 5 years ago

awnn-single-thread (awnn/r008hs)

** topdown tree

fengggli commented 5 years ago

awnn-4-threads (awnn/r009hs)

summary

fengggli commented 5 years ago

After applying a few optimizations(3991d3d, 5b3e389) suggested by intel guide, single-thread time reduced from 540 to 380 ms

Now i only need to vectorize im2col and col2im using avx512,

in intel caffe, its achieved by using xbyak
How is im2col implemented? what is the actual covolution method used?
- src/caffe/layers/mkl_convolution_layer.cpp
- jit gemm based convoultio
- jit im2col

fengggli / gpu-computing-materials

perf tuning #57

Notes about intel-caffe

summary

I could try but I don't know the benefits of keep trying to vectorize the code... (what's next)