Open giordano opened 2 years ago
the fast path
f(n) = Base._mapreduce_dim(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n, :)
## so totally different path
_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
_mapreduce(f, op, IndexStyle(A), A)
for array longer than 16, eventually calls:
mapreduce_impl(f, op, A, first(inds), last(finds))
mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize=1024)
the slow path:
g(n) = Base.mapfoldl_impl(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n)
Our fold
routine has no @simd
thus we need @fastmath
here to make LLVM vectorlize the sum reduction.
g_fast(n) = @fastmath mapfoldl(i -> 4/(4i-2n-3), +, 1:n)
In fact g_fast
should be faster than f
as it doesn't use pairwise reduction.
Let me repeat my suggestion from #49763: sum
and prod
use add_sum
and mul_prod
. What about making these functions use the fast versions of +
and *
? I think there is no fixed order of the terms required for sum
and prod
. Moreover, the fast versions seem to respect NaN
(at least on my machine). Then we would get:
julia> Base.add_sum(x::Real, y::Real) = @fastmath x+y
julia> v = rand(1000);
julia> @btime sum($v);
72.881 ns (0 allocations: 0 bytes) # before: 73.487 ns
julia> @btime sum($v; init = 0.0);
71.168 ns (0 allocations: 0 bytes) # before: 1.020 μs
In order to extend this to maximum
and minimum
, there are two issues:
max_fast
and min_fast
do not vectorize.NaN
as it's done now.For example, the function
@generated function fastmax(x::Float64, y::Float64)
quote
$(Expr(:meta, :inline));
Base.llvmcall("""
%f = fcmp fast ugt double %0, %1
%r = select i1 %f, double %0, double %1
ret double %r
""", Float64, Tuple{Float64,Float64}, x, y)
end
end
vectorizes, but it doesn't always return NaN
if one of the arguments is NaN
. Maybe one could add a flag to maximum
and minimum
to indicate that such a function can be used. It would be much faster than the current implementation:
julia> v = rand(1000);
# currently:
julia> @btime maximum($v);
1.019 μs (0 allocations: 0 bytes)
julia> @btime maximum($v; init = 0.0);
1.988 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v);
1.024 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0);
1.020 μs (0 allocations: 0 bytes)
# new
julia> @btime reduce(fastmax, $v);
74.428 ns (0 allocations: 0 bytes)
julia> @btime reduce(fastmax, $v; init = 0.0);
71.370 ns (0 allocations: 0 bytes)
In https://github.com/niklas-heer/speed-comparison we noticed that
sum(f, itr; init)
is slower thansum(f, itr)
which is a pity because
sum(f, itr; init)
can be statically compiled, whilesum(f, itr)
cannot (because of some internal function throwing).