JuliaLinearAlgebra / libblastrampoline

Using PLT trampolines to provide a BLAS and LAPACK demuxing library.
MIT License
66 stars 17 forks source link

Missing dot(ComplexF32, ComplexF32) for LBT4 + MKL #56

Closed ViralBShah closed 2 years ago

ViralBShah commented 2 years ago

Using the vs/lp64 branch of MKL.jl

julia> using LinearAlgebra

julia> x = rand(ComplexF32, 10); dot(x,x)
6.305651f0 + 0.0f0im

julia> using MKL

julia> x = rand(ComplexF32, 10); dot(x,x)
Error: no BLAS/LAPACK library loaded!
1.0f-45 + 4.5916f-41im

julia> BLAS.dotc(x,x)
Error: no BLAS/LAPACK library loaded!
2.5705f-41 + 0.0f0im
ViralBShah commented 2 years ago

Perhaps this is because cblas_cdotc_sub, cblas_zdotc_sub, cblas_zdotu_sub, and cblas_cdotu_sub do not have 64_ suffixes in MKL?

https://github.com/JuliaLang/julia/blob/1db8b8f160786c0ce23aed1fa865301fb9973329/stdlib/LinearAlgebra/src/blas.jl#L327

antoine-levitt commented 2 years ago

This is possibly unrelated and unhelpful so feel free to ignore, but I remember the dotc functions were particuarly nasty to work with, because they are the only BLAS functions that return a complex number, for which there is no unified ABI

ViralBShah commented 2 years ago

That is exactly the problem - which is why we use the CBLAS functions. But MKL does not have the 64_ suffixes for ILP64, it turns out.

ViralBShah commented 2 years ago

Maybe we should just implement dot in Julia instead of using BLAS.

ViralBShah commented 2 years ago

Or perhaps MKL has other APIs for dot that we may be able to use.

There's this but it's all C++: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top/blas-and-sparse-blas-routines/blas-level-1-routines-and-functions/dot.html

ViralBShah commented 2 years ago

@chriselrod Would you know if a native julia dot in Base would be competitive with BLAS (without LoopVectorization)?

chriselrod commented 2 years ago

@chriselrod Would you know if a native julia dot in Base would be competitive with BLAS (without LoopVectorization)?

It depends. In terms of multithreaded performance, no, Base Julia will not be competitive, but I also don't think multithreaded dot products are especially compelling.

In single threaded performance, LLVM doesn't always make the best decisions by default either. For example, on the M1, LLVM basically assumes it is an old phone chip with minimal out of order capabilities, and thus only limited unrolling is beneficial.

It does however do well for Complex:

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  34.995 ns (0 allocations: 0 bytes)
128.82773f0

julia> @btime dotnative($x, $y)
  48.371 ns (0 allocations: 0 bytes)
128.82771f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  283.276 ns (0 allocations: 0 bytes)
254.56534f0 + 11.200975f0im

julia> @btime dotnative($x, $y)
  136.343 ns (0 allocations: 0 bytes)
254.56543f0 + 11.200975f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1268
Commit 955d427135* (2022-01-10 15:37 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: Apple M1

Similarly, for CPUs with AVX512, LLVM prefers to only use 256 bit vectors, assuming some combination of downclocking / its poor handling of unvectorized remainders / chips only having a single 512 bit fma unit being problematic, so chips with 2 such units will do much better at power of 2 sizes when starting Julia with -C"native,-prefer-256-bit". Julia started normally:

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  26.785 ns (0 allocations: 0 bytes)
136.5527f0

julia> @btime dotnative($x, $y)
  25.224 ns (0 allocations: 0 bytes)
136.5527f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  68.608 ns (0 allocations: 0 bytes)
255.64641f0 + 1.9186249f0im

julia> @btime dotnative($x, $y)
  84.649 ns (0 allocations: 0 bytes)
255.64641f0 + 1.918611f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz

With -C"native,-prefer-256-bit" (turns off preference for 256 bit vectors):

julia> using LinearAlgebra, BenchmarkTools

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Float32, 512);

julia> y = rand(Float32, 512);

julia> @btime dot($x, $y)
  27.055 ns (0 allocations: 0 bytes)
136.53f0

julia> @btime dotnative($x, $y)
  21.177 ns (0 allocations: 0 bytes)
136.53f0

julia> x = rand(Complex{Float32}, 512);

julia> y = rand(Complex{Float32}, 512);

julia> @btime dot($x, $y)
  68.548 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648895f0im

julia> @btime dotnative($x, $y)
  56.924 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648867f0im

julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
chriselrod commented 2 years ago

Getting a little more creative with the definitions, we can improve performance at odd sizes:

julia> function dotnative_fastrem(x,y)
         T = Base.promote_eltype(x, y)
         s = zero(T)
         N = length(x);
         @assert length(y) == N "vectors should have equal length"
         @inbounds @fastmath begin
           N32 = N & -32
           for i in 1:N32
             s += x[i]'*y[i]
           end
           Nrem32 = N & 31
           if Nrem32 ≥ 16
             for i in 1:16
               s += x[i+N32]' * y[i+N32]
             end
             N32 += 16
           end
           Nrem16 = Nrem32 & 15
           if Nrem32 ≥ 8
             for i in 1:8
               s += x[i+N32]' * y[i+N32]
             end
             N32 += 8
           end
           Nrem8 = Nrem16 & 7
           for i in N32+1:N
             s += x[i]' * y[i]
           end
         end
         return s
       end
dotnative_fastrem (generic function with 1 method)

julia> function dotnative(x,y)
           s = zero(Base.promote_eltype(x, y))
           @fastmath for i in eachindex(x,y)
               s += x[i]'*y[i]
           end
           return s
       end
dotnative (generic function with 1 method)

julia> x = rand(Complex{Float32}, 511);

julia> y = rand(Complex{Float32}, 511);

julia> @btime dotnative($x,$y)
  128.278 ns (0 allocations: 0 bytes)
257.9638f0 + 6.247218f0im

julia> @btime dotnative_fastrem($x,$y)
  77.057 ns (0 allocations: 0 bytes)
257.96378f0 + 6.247218f0im

This was just my first attempt. We can probably do better, but it's already a substantial improvement on this arch (Skylake-X with -prefer-256-bit) when the remainder is large.

For comparison, the fastest LoopVectorization version:

julia> using LoopVectorization, VectorizationBase

julia> function cdot_swizzle(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
         a = reinterpret(T, ca)
         b = reinterpret(T, cb)
         reim = Vec(zero(T),zero(T)) # needs VectorizationBase
         @turbo for i ∈ eachindex(a)
           reim = vfmsubadd(vmovsldup(a[i]), b[i], vfmsubadd(vmovshdup(a[i]), vpermilps177(b[i]), reim))
         end
         Complex(reim(1), reim(2))
       end
cdot_swizzle (generic function with 1 method)

julia> x = rand(Complex{Float32}, 511);

julia> y = rand(Complex{Float32}, 511);

julia> @btime cdot_swizzle($x,$y)
  62.163 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im

julia> @btime dotnative($x,$y)
  128.648 ns (0 allocations: 0 bytes)
255.22118f0 - 13.409902f0im

julia> @btime dotnative_fastrem($x,$y)
  77.479 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im
ViralBShah commented 2 years ago

Thanks. That is pretty exhaustive! I feel that even in the trivial case, the difference is small enough that we may be ok going with a native implementation.

ViralBShah commented 2 years ago

Intel said they will have CBLAS functions with 64_ suffixes in the next release, but they recommend that we could provide wrappers of our own for cblas dot functions in LBT.

They recommend something like this. @staticfloat would this work?

double cblas_ddot_64(const MKL_INT64 N, const double *X, const MKL_INT64 incX, const double *Y, const MKL_INT64 incY)
{
   return ddot_64(&N, X, &incX, Y, &incY);
}
ViralBShah commented 2 years ago

Should we tag and release 4.2 and pull it into Julia. After that, I can revive the MKL PR.

giordano commented 2 years ago

There is already v5.0.0, which I presume should go in Julia v1.8