Open mcabbott opened 2 years ago
Gist with the llvm code generated for a similar problem https://gist.github.com/gbaraldi/9ed4d47fd7fabc55f5d013399af2862e on m1 mac native.
My guess is that this is a runtime memory check by LLVM that decides not to take the SIMD-path when it sees that the memory aliases. From the LLVM IR:
vector.memcheck: ; preds = %L178.us65.preheader
%scevgep99 = getelementptr float, float* %19, i64 %24
%26 = shl i64 %23, 2
%27 = add i64 %26, 4
%28 = mul i64 %10, %27
%uglygep = getelementptr i8, i8* %20, i64 %28
%bound0 = icmp ugt i8* %uglygep, %scevgep96
%bound1 = icmp ult float* %scevgep99, %scevgep97
%found.conflict = and i1 %bound0, %bound1
br i1 %found.conflict, label %L178.us65, label %vector.ph
The L178.us65
block holds the scalar version of the code.
Could confirm with simple @simd
loop, and adding ivdep
"fix"es the performance difference.
Maybe this is what you meant, but adding that to the loop in the method of copyto!
which is being called here speeds up the broadcast!
case above. Can this be done safely? IIRC you may need to avoid BitVectors, maybe there are other problems.
julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
A = bc.args[1]
if axes(dest) == axes(A)
return copyto!(dest, A)
end
end
bc′ = preprocess(dest, bc)
# Performance may vary depending on whether `@inbounds` is placed outside the
# for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
@simd ivdep for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
return dest
end
copyto! (generic function with 60 methods)
julia> @btime $x .= f23.($x);
1.562 μs (0 allocations: 0 bytes)
Of course, we can't add ivdep
to the most genenal fallback.
IIUC, we only need to concern whether the dest array is self-overlapped, as the src
s have been unaliased if needed. So some check like:
if dest isa StridedArray{<:Base.HWNumber} # just an example
@simd ivdep for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
else
@simd for I in eachindex(dest)
@inbounds dest[I] = bc′[I]
end
end
should be safe enough?
Some more examples:
julia> a = rand(1024); b = similar(a);
julia> @inline f(x, y, c) = x .= y .+ c # force inline
f (generic function with 1 method)
julia> ff(x, c) = f(x, x, c)
ff (generic function with 1 method)
julia> @btime f($b, $a, $0);
67.857 ns (0 allocations: 0 bytes) # fast as expect
julia> @btime f($a, $a, $0);
310.117 ns (0 allocations: 0 bytes) # slow as expect
julia> @btime ff($a, $0);
62.322 ns (0 allocations: 0 bytes) # inlined version is fast
julia> @noinline f(x, y, c) = x .= y .+ c # force noinline
f (generic function with 1 method)
julia> @btime ff($a, $0);
311.673 ns (0 allocations: 0 bytes) # noinlined version is slow
~The above bench indicates that inlined version could be vectorized correctly. (@code_llvm
also shows that the vector.memcheck
block is omiited).~ Edit: Further bench shows that only inlined 1d self inplace broadcast could be vectorlized properly (might caused by the index transformation as @simd
has seperates the inner most loop)
LLVM's auto vectorization has similar problem, so the best solution might be relaxing the current runtime check on llvm side?
Some updated timings; still appears to reproduce
julia> using BenchmarkTools
julia> @btime $y .= f23.($x);
935.345 ns (0 allocations: 0 bytes)
julia> @btime $x .= f23.($x);
4.494 μs (0 allocations: 0 bytes)
julia> @btime f23.($x);
1.012 μs (2 allocations: 39.11 KiB)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.3.0)
CPU: 10 × Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 8 virtual cores
julia> @btime $y .= f23.($x);
1.779 μs (0 allocations: 0 bytes)
julia> @btime $x .= f23.($x);
6.442 μs (0 allocations: 0 bytes)
julia> @btime f23.($x); # allocating; mean time 3.010 μs still faster than x .= case
1.972 μs (2 allocations: 39.11 KiB)
julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.5.0)
CPU: 10 × Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
Threads: 1 on 10 virtual cores
Can also reproduce on v.1.9
I was surprised by this slowdown when writing back into
x
, instead of into another arrayy
:This is 1.7.0-rc2, but similar on 1.5 and master, and on other computers. I don't think it's a benchmarking artefact, it persists with
evals=1 setup=(x=...; y=...)
. I don't think it's a hardware limit, since@turbo $x .= f23.($x)
with LoopVectorization.jl, or@.. $x = f23($x)
with FastBroadcast.jl, don't show this difference.For comparison, it seems that
map!
is never fast here, althoughmap
is: