flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.3k stars 367 forks source link

Benchmarking and Performance #255

Closed cdluminate closed 5 years ago

cdluminate commented 6 years ago

It would be better to provide some script for users to compare the performance between different BLAS implementations.

I wrote one with Julia 1.0, but interestingly BLIS's performance is not as good as I thought...

#!/usr/bin/julia-1.0
# Compare performance among different BLAS implementations
# Copyright (C) 2018 Mo Zhou <lumin@debian.org>, MIT/Expat License.
# Reference: Julia/stdlib/LinearAlgebra/src/blas.jl
using LinearAlgebra
using Logging

const N1 = 65536  # N for level1 calls
const N3 = 4096   # N for level3 calls
const netlibblas = "/usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0"
const atlas      = "/usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3"
const openblas   = "/usr/lib/x86_64-linux-gnu/openblas/libblas.so.3"
const mkl        = "/usr/lib/x86_64-linux-gnu/libmkl_rt.so"
const blis       = "/home/lumin/git/blis/lib/haswell/libblis.so"

BLASES = [blis, openblas, mkl]

for N3 in 2 .^ [1:12...]
    julia_dgemm = false
    @warn("Matrix size = $N3")
    for libblas in BLASES
        @eval begin
            function ffi_gemm!(transA::Char, transB::Char,
                               alpha::Float64, A::AbstractVecOrMat{Float64},
                               B::AbstractVecOrMat{Float64}, beta::Float64,
                               C::AbstractVecOrMat{Float64})
                m = size(A, transA == 'N' ? 1 : 2)
                ka = size(A, transA == 'N' ? 2 : 1)
                kb = size(B, transB == 'N' ? 1 : 2)
                n = size(B, transB == 'N' ? 2 : 1) 
                ccall((:dgemm_, $libblas), Cvoid,
                    (Ref{UInt8}, Ref{UInt8}, Ref{Int64}, Ref{Int64},
                     Ref{Int64}, Ref{Float64}, Ptr{Float64}, Ref{Int64},
                     Ptr{Float64}, Ref{Int64}, Ref{Float64}, Ptr{Float64},
                     Ref{Int64}),
                     transA, transB, m, n,
                     ka, alpha, A, max(1,stride(A,2)),
                     B, max(1,stride(B,2)), beta, C,
                     max(1,stride(C,2)))
            end
        end
        x, y, z = rand(N3, N3), rand(N3, N3), zeros(N3, N3)
        if !julia_dgemm
            @info("dgemm Julia")
            BLAS.gemm('N', 'N', x, y)  # JIT
            @time BLAS.gemm('N', 'N', x, y)
            julia_dgemm = true
        end
        @info("dgemm $libblas")
        ffi_gemm!('N', 'N', 1., x, y, 0., z)  # JIT
        @time ffi_gemm!('N', 'N', 1., x, y, 0., z)

        z2 = BLAS.gemm('N', 'N', x, y)
        ffi_gemm!('N', 'N', 1., x, y, 0., z)
        error = norm(z2 - z)
        if error > 1e-7
            @warn("dgemm Error : $error")  # correctness
        end
    end
end

Result:

┌ Warning: Matrix size = 2
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000006 seconds (5 allocations: 272 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000008 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 368 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000006 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 8
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 784 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000006 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 16
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000002 seconds (5 allocations: 2.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000007 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 32
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000009 seconds (5 allocations: 8.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000010 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000007 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000007 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 64
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000012 seconds (6 allocations: 32.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000024 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000022 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000009 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 128
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000047 seconds (6 allocations: 128.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000115 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000094 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000026 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 256
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.000590 seconds (6 allocations: 512.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000767 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000843 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000211 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 512
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.001751 seconds (6 allocations: 2.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.005836 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.005063 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.001578 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 1024
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.014705 seconds (6 allocations: 8.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.046483 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.025002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.014890 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 2048
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.110894 seconds (6 allocations: 32.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.357563 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.124215 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.111594 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4096
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/blascomp.jl:19
[ Info: dgemm Julia
  0.915499 seconds (6 allocations: 128.000 MiB, 0.59% gc time)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  2.829461 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  1.068244 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.972810 seconds (4 allocations: 160 bytes)

System information

Julia 1.0 compiled locally, with MKL as the default BLAS/LAPACK backend
CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i5-7440HQ CPU @ 2.80GHz
Stepping:            9
CPU MHz:             973.043
CPU max MHz:         3800.0000
CPU min MHz:         800.0000
BogoMIPS:            5616.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
jeffhammond commented 6 years ago

Are you sure that BLIS is compiled for threaded execution?

cdluminate commented 6 years ago

@jeffhammond Thanks for the hint. Initially I thought threading is enabled by default however the actual default is --enable-threading=no. I'll recompile and test again.

cdluminate commented 6 years ago

Result for pthread with BLIS_NUM_THREADS=4.

./configure --enable-verbose-make --enable-cblas --enable-threading=pthreads haswell

┌ Warning: Matrix size = 2
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000007 seconds (5 allocations: 272 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000115 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 368 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000144 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 8
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 784 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000130 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 16
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000020 seconds (5 allocations: 2.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000138 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000003 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 32
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000024 seconds (5 allocations: 8.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000484 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000008 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000006 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 64
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000020 seconds (6 allocations: 32.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.056032 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000024 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000010 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 128
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000056 seconds (6 allocations: 128.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.170078 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000098 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000028 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 256
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.001020 seconds (6 allocations: 512.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.003420 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000917 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000232 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 512
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.004190 seconds (6 allocations: 2.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.002998 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.005628 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.001674 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 1024
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.017962 seconds (6 allocations: 8.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.025300 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.031367 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.015002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 2048
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.135972 seconds (6 allocations: 32.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.144874 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.141345 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.133631 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4096
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  1.087331 seconds (6 allocations: 128.000 MiB, 0.51% gc time)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  1.171964 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  1.187948 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  1.163688 seconds (4 allocations: 160 bytes)

Now BLIS looks comparative to OpenBLAS, and the overhead of thread creation for small matrices is obvious.

Result for openmp.

./configure --enable-verbose-make --enable-cblas --enable-threading=openmp haswell

┌ Warning: Matrix size = 2
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000007 seconds (5 allocations: 272 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000017 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 368 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000014 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 8
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000001 seconds (5 allocations: 784 bytes)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000013 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 16
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000002 seconds (5 allocations: 2.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000016 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000003 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000002 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 32
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000020 seconds (5 allocations: 8.281 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000017 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000007 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000005 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 64
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000030 seconds (6 allocations: 32.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000024 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000022 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000009 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 128
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.000049 seconds (6 allocations: 128.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000053 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000047 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000029 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 256
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.001134 seconds (6 allocations: 512.234 KiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.000313 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.000316 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.000218 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 512
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.004426 seconds (6 allocations: 2.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.009258 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.005002 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.010772 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 1024
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.020614 seconds (6 allocations: 8.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.015663 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.031109 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.015756 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 2048
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.142615 seconds (6 allocations: 32.000 MiB)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  0.127974 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  0.130686 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  0.122616 seconds (4 allocations: 160 bytes)
┌ Warning: Matrix size = 4096
└ @ Main ~/Debian/intel-mkl.pkg/intel-mkl/debian/tests/dgemmcomp.jl:21
[ Info: dgemm Julia
  0.992390 seconds (6 allocations: 128.000 MiB, 0.55% gc time)
[ Info: dgemm /home/lumin/git/blis/lib/haswell/libblis.so
  1.078581 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
  1.101288 seconds (4 allocations: 160 bytes)
[ Info: dgemm /usr/lib/x86_64-linux-gnu/libmkl_rt.so
  1.023258 seconds (4 allocations: 160 bytes)

The openmp threading model has less threading overhead for small matrices.

@jeffhammond Have I correctly compiled BLIS this time? Or is there any way to further improve BLIS's performance? e.g. -march=native?

jeffhammond commented 6 years ago

This looks right to me. OpenMP should have lower overhead than Pthreads because the former uses a thread pool whereas the latter cannot (unless BLIS implements its own thread pool).

BLIS uses hand-written assembly so compiler flags related to code generation should have no effect on functions like DGEMM. You may find that flags related to inlining or link-time optimization help, but I would not expect a significant effect from that.

cdluminate commented 5 years ago

The performance benchmark has been added and I'm satisfied with that result. Maybe we can close this issue now?

fgvanzee commented 5 years ago

Sure thing. Thanks for your patience on this issue.

BTW, the tools that I used to create the new graphs on the Performance page are already included in the BLIS source distribution. They can be found in test/3. Currently, there is no documentation that specifically guides the usage of the tools in this directory, but most curious users can figure out how to use them by reading the Makefile, the runme.sh shell script, and the matlab code in the matlab subdirectory. (For now, the matlab code targets matlab, but with a little tweaking it can run in GNU Octave as well. Migrating the code more fully to Octave is on my to-do list.)

sav-ix commented 4 years ago

hello, few more questions:

  1. have you benchmarked BLAS C++ interfaces (Boost.uBLAS, blaspp, etc.) for use with {BLIS,OpenBLAS}?
  2. have you benchmarked LAPACK C++ interfaces (CPPLapack, Lapack++, LAPACK++, etc.) for use with {LIBFLAME,LAPACK}?
  3. are native C++ interfaces expected within {BLIS,LIBFLAME} packages? if not, what is a current solution for use additional floating-point formats (IEEE 754 binary{128,256}, decimal{64,128}, or user defined) within {BLIS,LIBFLAME}?

thanks.

fgvanzee commented 4 years ago
  1. We have not measured the performance of uBLAS or blaspp. Those would be interesting to add to our performance graphs someday.
  2. We have not had the time or resources lately to do all of the work we would like, particularly at the LAPACK/libflame level. This includes performance measurements.
  3. We do not have plans to provide C++ interfaces within BLIS or libflame. However, we may develop companion projects that do so. TBLIS, for example, implements a tensor framework library that has C++ bindings. We may support quad precision (IEEE 754 binary128) in the future in BLIS, though I would expect that hardware support would probably be a prerequisite. Binary16 and/or bfloat16 in BLIS are currently being worked on, though we don't have a timetable yet. We do not have any plans to support IEEE 754 decimal formats.
sav-ix commented 4 years ago

We may support quad precision (IEEE 754 binary128) in the future in BLIS, though I would expect that hardware support would probably be a prerequisite. We do not have any plans to support IEEE 754 decimal formats

May this change, when binary128 and decimal{32,64,128} became a part of ISO/IEC 9899:202x Standard?

jeffhammond commented 4 years ago

@sav-ix what datatypes are part of ISO/IEC/IEEE doesn't have a major impact on what math libraries support. User and developer interest does. If you want to see new datatypes supported, I encourage you to create an issue specific to each, e.g. https://github.com/flame/blis/issues/234. I know there is some interest in developing binary128 support in BLIS already...

jeffhammond commented 4 years ago

@sav-ix Regarding:

have you benchmarked BLAS C++ interfaces (Boost.uBLAS, blaspp, etc.) for use with {BLIS,OpenBLAS}?

This is something that would make a good third-party project. Why not start developing that yourself?