JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.75k stars 5.49k forks source link

`reduce` does not re-associate operations #48129

Open mikmoore opened 1 year ago

mikmoore commented 1 year ago

reduce misses the opportunity to re-associate operations when given NTuple arguments. This has performance impacts on notable functions such as sum(::NTuple). One can use @fastmath to permit re-association in some cases, but I'm never a fan of suggesting @fastmath to people.

julia> using BenchmarkTools

julia> x = ntuple(_->randn(), 16);

julia> @btime sum($x)
  5.700 ns (0 allocations: 0 bytes)
-2.8324452894507908

julia> @btime @fastmath(reduce(+,$x))
  3.100 ns (0 allocations: 0 bytes)
-2.8324452894507903

You can see the lack of vectorization by comparing code_native(sum,Tuple{NTuple{16,Float64}};debuginfo=:none) to the SIMDing version code_native(x->@fastmath(reduce(+,x)),Tuple{NTuple{16,Float64}};debuginfo=:none).

Methods like sum(::Vector) already do re-associate (with performance and accuracy benefits).

mikmoore commented 1 year ago

I was shortsighted in limiting this discussion to tuples. Unsurprisingly, it also applies to Array with nonstandard reduction functions. Taken from this Discourse thread:

julia> x = rand(512);

julia> @btime reduce((x,y) -> x<y ? x : y, $x, init=Inf)
  395.203 ns (0 allocations: 0 bytes)
0.0004658392690449764

julia> @btime reduce((x,y) -> @fastmath(x < y) ? x : y, $x, init=Inf)
  33.046 ns (0 allocations: 0 bytes)
0.0004658392690449764
gbaraldi commented 1 year ago

@fastmath removes extra checks, so the comparison isn't perfectly fair, but we should allow it anyway.

mikmoore commented 1 year ago

The speedups in both examples are purely due to re-association. There aren't any extra checks involved and the core functionality is implemented using native x86-64 instructions (vaddsd/vaddpd for +, vminsd/vminpd for x<y ? x : y) . The only differences are SIMD and (in the Vector case) unrolling in the re-associated versions.

But yes, all the extra transformations that @fastmath could enable in other situations (e.g., used with min) are why this should be standalone functionality in reduce.