JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.48k forks source link

`broadcast!(f, x, x)` slower, prevents SIMD? #43153

Open mcabbott opened 2 years ago

mcabbott commented 2 years ago

I was surprised by this slowdown when writing back into x, instead of into another array y:

julia> f23(x) = ifelse(x>0, x^2, x^3);

julia> x = randn(Float32, 100, 100); y = similar(x);

julia> @btime $y .= f23.($x);
  1.958 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.567 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  2.236 μs (2 allocations: 39.11 KiB)

This is 1.7.0-rc2, but similar on 1.5 and master, and on other computers. I don't think it's a benchmarking artefact, it persists with evals=1 setup=(x=...; y=...). I don't think it's a hardware limit, since @turbo $x .= f23.($x) with LoopVectorization.jl, or @.. $x = f23($x) with FastBroadcast.jl, don't show this difference.

For comparison, it seems that map! is never fast here, although map is:

julia> @btime map!(f23, $y, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map!(f23, $x, $x);
  8.055 μs (0 allocations: 0 bytes)

julia> @btime map(f23, $x);
  1.917 μs (2 allocations: 39.11 KiB)
gbaraldi commented 2 years ago

Gist with the llvm code generated for a similar problem https://gist.github.com/gbaraldi/9ed4d47fd7fabc55f5d013399af2862e on m1 mac native.

KristofferC commented 2 years ago

My guess is that this is a runtime memory check by LLVM that decides not to take the SIMD-path when it sees that the memory aliases. From the LLVM IR:

vector.memcheck:                                  ; preds = %L178.us65.preheader
     %scevgep99 = getelementptr float, float* %19, i64 %24
     %26 = shl i64 %23, 2
     %27 = add i64 %26, 4
     %28 = mul i64 %10, %27
     %uglygep = getelementptr i8, i8* %20, i64 %28
     %bound0 = icmp ugt i8* %uglygep, %scevgep96
     %bound1 = icmp ult float* %scevgep99, %scevgep97
     %found.conflict = and i1 %bound0, %bound1
     br i1 %found.conflict, label %L178.us65, label %vector.ph

The L178.us65 block holds the scalar version of the code.

N5N3 commented 2 years ago

Could confirm with simple @simd loop, and adding ivdep "fix"es the performance difference.

mcabbott commented 2 years ago

Maybe this is what you meant, but adding that to the loop in the method of copyto! which is being called here speeds up the broadcast! case above. Can this be done safely? IIRC you may need to avoid BitVectors, maybe there are other problems.

julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   return copyto!(dest, A)
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @simd ivdep for I in eachindex(dest)
               @inbounds dest[I] = bc′[I]
           end
           return dest
       end
copyto! (generic function with 60 methods)

julia> @btime $x .= f23.($x);
  1.562 μs (0 allocations: 0 bytes)
N5N3 commented 2 years ago

Of course, we can't add ivdep to the most genenal fallback. IIUC, we only need to concern whether the dest array is self-overlapped, as the srcs have been unaliased if needed. So some check like:

if dest isa StridedArray{<:Base.HWNumber} # just an example
    @simd ivdep for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
    end
else
   @simd for I in eachindex(dest)
       @inbounds dest[I] = bc′[I]
   end
end

should be safe enough?

N5N3 commented 2 years ago

Some more examples:

julia> a = rand(1024); b = similar(a);
julia> @inline f(x, y, c) = x .= y .+ c # force inline
f (generic function with 1 method)
julia> ff(x, c) = f(x, x, c)
ff (generic function with 1 method)
julia> @btime f($b, $a, $0);      
  67.857 ns (0 allocations: 0 bytes)   # fast as expect
julia> @btime f($a, $a, $0);
  310.117 ns (0 allocations: 0 bytes) # slow as expect
julia> @btime ff($a, $0);
  62.322 ns (0 allocations: 0 bytes)   # inlined version is fast
julia> @noinline f(x, y, c) = x .= y .+ c  # force noinline
f (generic function with 1 method)
julia> @btime ff($a, $0);
  311.673 ns (0 allocations: 0 bytes)  # noinlined version is slow

~The above bench indicates that inlined version could be vectorized correctly. (@code_llvm also shows that the vector.memcheck block is omiited).~ Edit: Further bench shows that only inlined 1d self inplace broadcast could be vectorlized properly (might caused by the index transformation as @simd has seperates the inner most loop) LLVM's auto vectorization has similar problem, so the best solution might be relaxing the current runtime check on llvm side?

brenhinkeller commented 1 year ago

Some updated timings; still appears to reproduce

julia> using BenchmarkTools

julia> @btime $y .= f23.($x);
  935.345 ns (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  4.494 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);
  1.012 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
julia> @btime $y .= f23.($x);
  1.779 μs (0 allocations: 0 bytes)

julia> @btime $x .= f23.($x);
  6.442 μs (0 allocations: 0 bytes)

julia> @btime f23.($x);  # allocating; mean time 3.010 μs still faster than x .= case
  1.972 μs (2 allocations: 39.11 KiB)

julia> versioninfo()
Julia Version 1.9.0-alpha1
Commit 0540f9d7394 (2022-11-15 14:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.5.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, westmere)
  Threads: 1 on 10 virtual cores
miguelraz commented 1 year ago

Can also reproduce on v.1.9