Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I

pramodk commented 3 years ago

As part of this ticket, we are going to benchmark LLVM code generation backend with different configurations. Here are some practical considerations:

Simple synthetic kernels with basic patterns (involving memory accesses, gather, div, exp) vs real-word MOD files
Dataset fitting in memory vs DRAM
Different Vector widts
With / wotjpit VecLibReplace with SVML
Different (Inte CPUl) backends : SSE, AVX-2, AVX-512

@georgemitenkov : I have assigned this to myself temporarily as I am going to do simple cross-checks with performance numbers with recently added --veclib SVML option.

pramodk commented 3 years ago

Just to update here @georgemitenkov : I have tested few small examples and SSE vs AVX-2 examples locally. But for detailed analysis, I will wait for #611 ( / #612) so that assembly & performance metrics could be analysed in the detailed.

georgemitenkov commented 3 years ago

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me. I have started looking at the debug info, so hopefully this one should be ready soonish (~Thursday).

Regarding assembly verification: ideally, do we want to dump it to the log file, so that the structure is:

====== start

NMODL source (not in log file for now)
NMODL after transformations (we only print kernels, so that's fine)
Generated LLVM
Generated assembly from JIT?

====== JIT part

Visiting time
Benchmark time

====== end

What do you think? @pramodk

pramodk commented 3 years ago

Great! I had an exam yesterday so Monday/Tuesday were a bit out for me.

Oh ok! Np!

What do you think? @pramodk

Yup, above part LGTM!

castigli commented 3 years ago

Just as initial reference, below is a summary of current timings on x86_64. First line is JIT, second line is external kernel (note that there is some overhead from JIT calling mechanism) JIT options are

--fmf nnan contract afn --vector-width 8 --veclib SVML benchmark \
--opt-level-ir 3 --opt-level-codegen 3 --run --instance-size 100000000 \
--repeat 10

compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.322915
compute-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.419407
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.344690
compute-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.423696
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.350585
compute-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.319667
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347119
compute-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.320830
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.323365
compute-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.317312
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.347382
compute-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.629991
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.659959
hh_clang_-O3-march=skylake-avx512-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.597442
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.639105
hh_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 2.132582
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.635455
hh_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.510965
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634934
hh_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 10.587418
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.610168
hh_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 12.130137
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 1.634898
hh_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 3.086421
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.610445
hh_gcc_-O3-mavx2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.701414
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.614212
hh_gcc_-O3-mavx512f-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 10.897828
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 1.611068
hh_gcc_-O3-msse2-ffast-math-ftree-vectorize-mveclibabi=svml.log:[NMODL] [info] :: Average compute time = 11.025482
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.622493
hh_icpc_-O2-march=skylake-avx512-mtune=skylake-avx512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.913908
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792381
hh_icpc_-O2-mavx2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.908091
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.794239
hh_icpc_-O2-mavx512f-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.576430
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.792621
hh_icpc_-O2-msse2-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 3.003994
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.612436
hh_icpc_-O2-qopt-zmm-usage=high-xCORE-AVX512-prec-div-fimf-use-svml.log:[NMODL] [info] :: Average compute time = 1.750384
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402982
memory-bound_clang_-O3-march=skylake-avx512-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404010
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402691
memory-bound_clang_-O3-mavx2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403016
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402822
memory-bound_clang_-O3-mavx512f-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403130
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.402736
memory-bound_clang_-O3-mavx512f-ffast-math-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403115
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.405940
memory-bound_clang_-O3-mavx512f-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.406087
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.403234
memory-bound_clang_-O3-msse2-ffast-math-fopenmp-fveclib=SVML.log:[NMODL] [info] :: Average compute time = 0.404857

georgemitenkov commented 3 years ago

Thanks @castigli ! Any specific reason we use only nnan contract afn and not fast for fast math flags?

castigli commented 3 years ago

no, except that I forgot to add it! I will re-run the test.

georgemitenkov commented 3 years ago

@pramodk @castigli @iomaganaris

Current configurations would be, with [..] indicating a test parameter

llvm --ir [--fmf fast] [--assume-may-alias] [--single-precision] --vector-width [W] --veclib [LIB] --opt-level-ir 3 \
benchmark -run --instance-size [S] --repeat [R] --opt-level-codegen 3 --cpu [cpu name or default] --libs [...]

For CPU names, we can use any that Clang supports. We also want to see the effect of aliasing, and see how performance for floats differ (float => 32bits => vector width is greater => maybe more scatter/gather overhead)

BlueBrain / nmodl

Benchmarking / performance measurement of LLVM backend on Intel CPUs : Part I #613