JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
71 stars 40 forks source link

Call user function only once in `mean` #151

Open nalimilan opened 12 months ago

nalimilan commented 12 months ago

Override the standard mapreduce machinery to promote accumulator type. This avoid calling the function twice, which can be confusing.

Fixes #49 but with a more general solution than https://github.com/JuliaStats/Statistics.jl/pull/80/.

Cc: @stevengj @kagalenko-m-b

Unfortunately, performance drops for a Float64 vector:

julia> x = rand(100_000_000);

julia> y = ifelse.(rand(Bool, length(x)) .> .25, missing, x);

julia> z = Vector{Union{Missing, Float64, Int}}(y);

julia> z .= ifelse.(rand(Bool, length(y)) .> .25, round.(Int, x), z);

# master
julia> @btime mean(x);
  68.582 ms (1 allocation: 16 bytes)

julia> @btime mean(y);
  472.870 ms (1 allocation: 16 bytes)

julia> @btime mean(z);
  533.646 ms (3 allocations: 48 bytes)

# PR
julia> @btime mean(x);
  128.785 ms (4 allocations: 64 bytes)

julia> @btime mean(y);
  506.305 ms (1 allocation: 16 bytes)

julia> @btime mean(z);
  598.425 ms (1 allocation: 16 bytes)

This is simply due to sum being slower when init is provided as it calls mapfoldl under the hood in that case:

julia> @btime sum(x);
  68.199 ms (1 allocation: 16 bytes)

julia> @btime sum(x, init=0.0);
  124.127 ms (1 allocation: 16 bytes)
nalimilan commented 12 months ago

The new commit implements a simpler approach which is fast for concrete eltypes but super slow for complex type unions as the compiler doesn't infer an element type more precise than Any


julia> @btime mean(x);
  68.529 ms (4 allocations: 64 bytes)

julia> @btime mean(y);
  550.489 ms (0 allocations: 0 bytes)

julia> @btime mean(z);
  2.752 s (25300984 allocations: 386.06 MiB)
kagalenko-m-b commented 12 months ago

It looks like you're reimplementing undocumented Base internals, doesn't that have potential to break in future Julia versions?

nalimilan commented 12 months ago

Yes, but given that Statistics is an stdlib any breakage would be noticed in Julia even before merging it. I'm more concerned about performance issues introduced by the PR...