Closed cdluminate closed 5 years ago
Are you sure that BLIS is compiled for threaded execution?
@jeffhammond Thanks for the hint. Initially I thought threading is enabled by default however the actual default is --enable-threading=no
. I'll recompile and test again.
Result for pthread with BLIS_NUM_THREADS=4
.
./configure --enable-verbose-make --enable-cblas --enable-threading=pthreads haswell
┌ Warning: Matrix size = 2
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000007 seconds (5 allocations: 272 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000115 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000001 seconds (5 allocations: 368 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000144 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 8
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000001 seconds (5 allocations: 784 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000130 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 16
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000020 seconds (5 allocations: 2.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000138 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 32
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000024 seconds (5 allocations: 8.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000484 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000008 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000006 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 64
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000020 seconds (6 allocations: 32.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.056032 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000024 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000010 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 128
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000056 seconds (6 allocations: 128.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.170078 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000098 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000028 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 256
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.001020 seconds (6 allocations: 512.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.003420 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000917 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000232 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 512
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.004190 seconds (6 allocations: 2.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.002998 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.005628 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.001674 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 1024
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.017962 seconds (6 allocations: 8.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.025300 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.031367 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.015002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 2048
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.135972 seconds (6 allocations: 32.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.144874 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.141345 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.133631 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4096
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
1.087331 seconds (6 allocations: 128.000 MiB, 0.51% gc time)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
1.171964 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
1.187948 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
1.163688 seconds (4 allocations: 160 bytes)
Now BLIS looks comparative to OpenBLAS, and the overhead of thread creation for small matrices is obvious.
Result for openmp.
./configure --enable-verbose-make --enable-cblas --enable-threading=openmp haswell
┌ Warning: Matrix size = 2
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000007 seconds (5 allocations: 272 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000017 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000001 seconds (5 allocations: 368 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000014 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 8
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000001 seconds (5 allocations: 784 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000013 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 16
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000002 seconds (5 allocations: 2.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000016 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 32
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000020 seconds (5 allocations: 8.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000017 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000007 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000005 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 64
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000030 seconds (6 allocations: 32.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000024 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000022 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000009 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 128
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.000049 seconds (6 allocations: 128.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000053 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000047 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000029 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 256
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.001134 seconds (6 allocations: 512.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.000313 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.000316 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.000218 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 512
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.004426 seconds (6 allocations: 2.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.009258 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.005002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.010772 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 1024
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.020614 seconds (6 allocations: 8.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.015663 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.031109 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.015756 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 2048
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.142615 seconds (6 allocations: 32.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
0.127974 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
0.130686 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
0.122616 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4096
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
0.992390 seconds (6 allocations: 128.000 MiB, 0.55% gc time)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
1.078581 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
1.101288 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
1.023258 seconds (4 allocations: 160 bytes)
The openmp threading model has less threading overhead for small matrices.
@jeffhammond Have I correctly compiled BLIS this time? Or is there any way to further improve BLIS's performance? e.g. -march=native
?
This looks right to me. OpenMP should have lower overhead than Pthreads because the former uses a thread pool whereas the latter cannot (unless BLIS implements its own thread pool).
BLIS uses hand-written assembly so compiler flags related to code generation should have no effect on functions like DGEMM. You may find that flags related to inlining or link-time optimization help, but I would not expect a significant effect from that.
The performance benchmark has been added and I'm satisfied with that result. Maybe we can close this issue now?
Sure thing. Thanks for your patience on this issue.
BTW, the tools that I used to create the new graphs on the Performance page are already included in the BLIS source distribution. They can be found in test/3
. Currently, there is no documentation that specifically guides the usage of the tools in this directory, but most curious users can figure out how to use them by reading the Makefile
, the runme.sh
shell script, and the matlab code in the matlab
subdirectory. (For now, the matlab code targets matlab, but with a little tweaking it can run in GNU Octave as well. Migrating the code more fully to Octave is on my to-do list.)
hello, few more questions:
thanks.
We may support quad precision (IEEE 754 binary128) in the future in BLIS, though I would expect that hardware support would probably be a prerequisite. We do not have any plans to support IEEE 754 decimal formats
May this change, when binary128 and decimal{32,64,128} became a part of ISO/IEC 9899:202x Standard?
@sav-ix what datatypes are part of ISO/IEC/IEEE doesn't have a major impact on what math libraries support. User and developer interest does. If you want to see new datatypes supported, I encourage you to create an issue specific to each, e.g. https://github.com/flame/blis/issues/234. I know there is some interest in developing binary128 support in BLIS already...
@sav-ix Regarding:
have you benchmarked BLAS C++ interfaces (Boost.uBLAS, blaspp, etc.) for use with {BLIS,OpenBLAS}?
This is something that would make a good third-party project. Why not start developing that yourself?
It would be better to provide some script for users to compare the performance between different BLAS implementations.
I wrote one with Julia 1.0, but interestingly BLIS's performance is not as good as I thought...
Result:
System information