Closed ViralBShah closed 2 years ago
Perhaps this is because cblas_cdotc_sub
, cblas_zdotc_sub
, cblas_zdotu_sub
, and cblas_cdotu_sub
do not have 64_ suffixes in MKL?
This is possibly unrelated and unhelpful so feel free to ignore, but I remember the dotc functions were particuarly nasty to work with, because they are the only BLAS functions that return a complex number, for which there is no unified ABI
That is exactly the problem - which is why we use the CBLAS functions. But MKL does not have the 64_ suffixes for ILP64, it turns out.
Maybe we should just implement dot
in Julia instead of using BLAS.
Or perhaps MKL has other APIs for dot
that we may be able to use.
There's this but it's all C++: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top/blas-and-sparse-blas-routines/blas-level-1-routines-and-functions/dot.html
@chriselrod Would you know if a native julia dot
in Base would be competitive with BLAS (without LoopVectorization)?
@chriselrod Would you know if a native julia
dot
in Base would be competitive with BLAS (without LoopVectorization)?
It depends. In terms of multithreaded performance, no, Base Julia will not be competitive, but I also don't think multithreaded dot products are especially compelling.
In single threaded performance, LLVM doesn't always make the best decisions by default either. For example, on the M1, LLVM basically assumes it is an old phone chip with minimal out of order capabilities, and thus only limited unrolling is beneficial.
It does however do well for Complex:
julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
34.995 ns (0 allocations: 0 bytes)
128.82773f0
julia> @btime dotnative($x, $y)
48.371 ns (0 allocations: 0 bytes)
128.82771f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
283.276 ns (0 allocations: 0 bytes)
254.56534f0 + 11.200975f0im
julia> @btime dotnative($x, $y)
136.343 ns (0 allocations: 0 bytes)
254.56543f0 + 11.200975f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1268
Commit 955d427135* (2022-01-10 15:37 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.2.0)
CPU: Apple M1
Similarly, for CPUs with AVX512, LLVM prefers to only use 256 bit vectors, assuming some combination of downclocking / its poor handling of unvectorized remainders / chips only having a single 512 bit fma unit being problematic, so chips with 2 such units will do much better at power of 2 sizes when starting Julia with -C"native,-prefer-256-bit"
.
Julia started normally:
julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
26.785 ns (0 allocations: 0 bytes)
136.5527f0
julia> @btime dotnative($x, $y)
25.224 ns (0 allocations: 0 bytes)
136.5527f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
68.608 ns (0 allocations: 0 bytes)
255.64641f0 + 1.9186249f0im
julia> @btime dotnative($x, $y)
84.649 ns (0 allocations: 0 bytes)
255.64641f0 + 1.918611f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
With -C"native,-prefer-256-bit"
(turns off preference for 256 bit vectors):
julia> using LinearAlgebra, BenchmarkTools
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Float32, 512);
julia> y = rand(Float32, 512);
julia> @btime dot($x, $y)
27.055 ns (0 allocations: 0 bytes)
136.53f0
julia> @btime dotnative($x, $y)
21.177 ns (0 allocations: 0 bytes)
136.53f0
julia> x = rand(Complex{Float32}, 512);
julia> y = rand(Complex{Float32}, 512);
julia> @btime dot($x, $y)
68.548 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648895f0im
julia> @btime dotnative($x, $y)
56.924 ns (0 allocations: 0 bytes)
255.58414f0 - 6.4648867f0im
julia> versioninfo()
Julia Version 1.8.0-DEV.1370
Commit 816c6a2627* (2022-01-21 20:05 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
Getting a little more creative with the definitions, we can improve performance at odd sizes:
julia> function dotnative_fastrem(x,y)
T = Base.promote_eltype(x, y)
s = zero(T)
N = length(x);
@assert length(y) == N "vectors should have equal length"
@inbounds @fastmath begin
N32 = N & -32
for i in 1:N32
s += x[i]'*y[i]
end
Nrem32 = N & 31
if Nrem32 ≥ 16
for i in 1:16
s += x[i+N32]' * y[i+N32]
end
N32 += 16
end
Nrem16 = Nrem32 & 15
if Nrem32 ≥ 8
for i in 1:8
s += x[i+N32]' * y[i+N32]
end
N32 += 8
end
Nrem8 = Nrem16 & 7
for i in N32+1:N
s += x[i]' * y[i]
end
end
return s
end
dotnative_fastrem (generic function with 1 method)
julia> function dotnative(x,y)
s = zero(Base.promote_eltype(x, y))
@fastmath for i in eachindex(x,y)
s += x[i]'*y[i]
end
return s
end
dotnative (generic function with 1 method)
julia> x = rand(Complex{Float32}, 511);
julia> y = rand(Complex{Float32}, 511);
julia> @btime dotnative($x,$y)
128.278 ns (0 allocations: 0 bytes)
257.9638f0 + 6.247218f0im
julia> @btime dotnative_fastrem($x,$y)
77.057 ns (0 allocations: 0 bytes)
257.96378f0 + 6.247218f0im
This was just my first attempt. We can probably do better, but it's already a substantial improvement on this arch (Skylake-X with -prefer-256-bit
) when the remainder is large.
For comparison, the fastest LoopVectorization version:
julia> using LoopVectorization, VectorizationBase
julia> function cdot_swizzle(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(T, ca)
b = reinterpret(T, cb)
reim = Vec(zero(T),zero(T)) # needs VectorizationBase
@turbo for i ∈ eachindex(a)
reim = vfmsubadd(vmovsldup(a[i]), b[i], vfmsubadd(vmovshdup(a[i]), vpermilps177(b[i]), reim))
end
Complex(reim(1), reim(2))
end
cdot_swizzle (generic function with 1 method)
julia> x = rand(Complex{Float32}, 511);
julia> y = rand(Complex{Float32}, 511);
julia> @btime cdot_swizzle($x,$y)
62.163 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im
julia> @btime dotnative($x,$y)
128.648 ns (0 allocations: 0 bytes)
255.22118f0 - 13.409902f0im
julia> @btime dotnative_fastrem($x,$y)
77.479 ns (0 allocations: 0 bytes)
255.22119f0 - 13.409902f0im
Thanks. That is pretty exhaustive! I feel that even in the trivial case, the difference is small enough that we may be ok going with a native implementation.
Intel said they will have CBLAS functions with 64_ suffixes in the next release, but they recommend that we could provide wrappers of our own for cblas dot functions in LBT.
They recommend something like this. @staticfloat would this work?
double cblas_ddot_64(const MKL_INT64 N, const double *X, const MKL_INT64 incX, const double *Y, const MKL_INT64 incY)
{
return ddot_64(&N, X, &incX, Y, &incY);
}
Should we tag and release 4.2 and pull it into Julia. After that, I can revive the MKL PR.
Using the
vs/lp64
branch of MKL.jl