JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.75k stars 5.49k forks source link

x86 `mapreduce` performance anomaly #50827

Open chrstphrbrns opened 1 year ago

chrstphrbrns commented 1 year ago

sum(f,A) performs significantly worse than sum(f.(A)) for integer inputs to certain transcendental functions on x86 (maybe specific to AMD?)

julia> versioninfo()
Julia Version 1.11.0-DEV.237
Commit 958da95647 (2023-08-07 21:48 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 47 on 32 virtual cores
Environment:
  LD_PRELOAD = /lib/x86_64-linux-gnu/libc_malloc_debug.so.0
  JULIA_NUM_THREADS = 32
  JULIA_EDITOR = vim

julia> a=collect(1:1000000);

julia> @btime sum(sin.(a))
  8.574 ms (4 allocations: 7.63 MiB)
-0.11710952409815278

julia> @btime sum(sin,a)
  12.908 ms (1 allocation: 16 bytes)
-0.11710952409817987

julia> @btime sum(log.(a))
  8.066 ms (4 allocations: 7.63 MiB)
1.2815518384658169e7

julia> @btime sum(log,a)
  6.302 ms (1 allocation: 16 bytes)
1.281551838465817e7

Different story on Apple Silicon

julia> versioninfo()
Julia Version 1.11.0-DEV.237
Commit 958da95647 (2023-08-07 21:48 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin23.0.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = vim

julia> a=collect(1:1000000);

julia> @btime sum(sin.(a))
  6.621 ms (4 allocations: 7.63 MiB)
-0.11710952409819408

julia> @btime sum(sin,a)
  5.972 ms (1 allocation: 16 bytes)
-0.11710952409817987
brenhinkeller commented 1 year ago

It looks like there's also quite a bit of performance left on the table in both cases..

julia> @btime sum(sin.($a))
  6.207 ms (2 allocations: 7.63 MiB)
-0.11710952409819408

julia> @btime sum(sin,$a)
  5.925 ms (0 allocations: 0 bytes)
-0.11710952409817987

julia> using LoopVectorization

julia> @btime vmapreduce(sin, +, $a)
  1.501 ms (0 allocations: 0 bytes)
-0.11710952409810094

(compared to LoopVectorization.jl)