Inferior performance of inplace broadcasting over `map!` for multiple wide matrices

jishnub commented 1 year ago

See https://discourse.julialang.org/t/why-is-a-multi-argument-inplace-map-much-faster-in-this-case-than-a-broadcast/91525/6, the following seems broadly reproducible across a range of platforms:

julia> A = rand(10, 1000); B = copy(A); C = zero(A);

julia> using BenchmarkTools

julia> @btime map!(+, $C, $A, $B);
  8.947 μs (0 allocations: 0 bytes)

julia> @btime $C .= $A .+ $B;
  11.281 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores
Environment:
  LD_LIBRARY_PATH = /home/user/lib:/lib::/home/user/.local/lib
  JULIA_DEPOT_PATH = /scratch/user/julia
  JULIA_REVISE_POLL = 1
  JULIA_NUM_PRECOMPILE_TASKS = 1
  JULIA_EDITOR = vi

This difference goes away for nearly square matrices, and is minimal for tall matrices. On some platforms, broadcasting performs better for the tall and square cases. However, map! seems to consistently do better for the wide case.

N5N3 commented 1 year ago

I think we already have a issue to track this problem but I can't find it. Anyway, the dim1 in this example is too small for vectorization and our broadcast only tries to vectorize the inner most loop. map! is faster here only because the inputs have linear index. If they are cartesian-indexed, I guess map! would be slower as IIRC it's zip based.

jishnub commented 1 year ago

Such wide matrices are frequently encountered in BandedMatrices. It would be good if broadcasting had a fast path that used linear indexing if all terms are compatible with it.

N5N3 commented 1 year ago

We have some related (Edit: but for different purpose) trial in Base (#30973), but I can't say it would be landed in the near future. For Pkg dev I think Fastbroadcast.jl might be the best solution for now. (Although it might not be that light weight ...)

giordano commented 1 year ago

I think we already have a issue to track this problem but I can't find it.

Do you refer to #28126?

N5N3 commented 1 year ago

No the number of broadcasted args in MWE above is too small thus it won't hit https://github.com/JuliaLang/julia/issues/28126. The solution here is switching to linear-indexing if we can prove:

all args have the same axes (or they have 0 dimension.)
all args are IndexLinear.

JuliaLang / julia

Inferior performance of inplace broadcasting over `map!` for multiple wide matrices #47873