`sum(f, itr; init)` slower than `sum(f, itr)`

giordano commented 2 years ago

In https://github.com/niklas-heer/speed-comparison we noticed that sum(f, itr; init) is slower than sum(f, itr)

julia> f(n) = sum(i -> 4/(4i-2n-3), 1:n)
f (generic function with 1 method)

julia> g(n) = sum(i -> 4/(4i-2n-3), 1:n; init=0.0)
g (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f(100_000_000)
  217.560 ms (0 allocations: 0 bytes)
3.1415926435897923

julia> @btime g(100_000_000)
  398.015 ms (0 allocations: 0 bytes)
3.1415926435899633

julia> versioninfo()
Julia Version 1.9.0-DEV.1600
Commit 392bc97a3a (2022-10-16 17:39 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, haswell)
  Threads: 1 on 8 virtual cores

which is a pity because sum(f, itr; init) can be statically compiled, while sum(f, itr) cannot (because of some internal function throwing).

Moelf commented 2 years ago

the fast path

f(n) = Base._mapreduce_dim(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n, :)

## so totally different path
_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
    _mapreduce(f, op, IndexStyle(A), A)

for array longer than 16, eventually calls:

mapreduce_impl(f, op, A, first(inds), last(finds))

mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize=1024)

the slow path:

g(n) = Base.mapfoldl_impl(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n)

N5N3 commented 2 years ago

Our fold routine has no @simd thus we need @fastmath here to make LLVM vectorlize the sum reduction.

g_fast(n) = @fastmath mapfoldl(i -> 4/(4i-2n-3), +, 1:n)

In fact g_fast should be faster than f as it doesn't use pairwise reduction.

matthias314 commented 1 year ago

Let me repeat my suggestion from #49763: sum and prod use add_sum and mul_prod. What about making these functions use the fast versions of + and *? I think there is no fixed order of the terms required for sum and prod. Moreover, the fast versions seem to respect NaN (at least on my machine). Then we would get:

julia> Base.add_sum(x::Real, y::Real) = @fastmath x+y
julia> v = rand(1000);
julia> @btime sum($v);
  72.881 ns (0 allocations: 0 bytes)   # before: 73.487 ns
julia> @btime sum($v; init = 0.0);
  71.168 ns (0 allocations: 0 bytes)   # before: 1.020 μs

In order to extend this to maximum and minimum, there are two issues:

The current versions of max_fast and min_fast do not vectorize.
It seems tricky to get versions that vectorize and treat NaN as it's done now.

For example, the function

@generated function fastmax(x::Float64, y::Float64)
    quote
        $(Expr(:meta, :inline));
        Base.llvmcall("""
    %f = fcmp fast ugt double %0, %1
    %r = select i1 %f, double %0, double %1
    ret double %r
            """, Float64, Tuple{Float64,Float64}, x, y)
    end
end

vectorizes, but it doesn't always return NaN if one of the arguments is NaN. Maybe one could add a flag to maximum and minimum to indicate that such a function can be used. It would be much faster than the current implementation:

julia> v = rand(1000);
# currently:
julia> @btime maximum($v);
  1.019 μs (0 allocations: 0 bytes)
julia> @btime maximum($v; init = 0.0);
  1.988 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v);
  1.024 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0);
  1.020 μs (0 allocations: 0 bytes)
# new
julia> @btime reduce(fastmax, $v);
  74.428 ns (0 allocations: 0 bytes)
julia> @btime reduce(fastmax, $v; init = 0.0);
  71.370 ns (0 allocations: 0 bytes)

JuliaLang / julia

`sum(f, itr; init)` slower than `sum(f, itr)` #47216