JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
70 stars 40 forks source link

Improve `middle(::AbstractRange)` performance #116

Closed vyu closed 2 years ago

vyu commented 2 years ago

Currently, the behavior of middle is inconsistent between ranges and vectors with non-Real eltypes:

julia> using Statistics

julia> r = LinRange(1im, 2im, 2)
2-element LinRange{ComplexF64, Int64}:
 0.0+1.0im,0.0+2.0im

julia> middle(r)
0.0 + 1.5im

julia> middle(collect(r))
ERROR: MethodError: no method matching isless(::ComplexF64, ::ComplexF64)
[...]

This PR fixes this by restricting middle(::AbstractRange) to Real eltypes. Performance is improved by calling mean, which elides bounds checks and does not call length (which is slow for StepRange).

Also, the doc for middle(range) is incorrect since ranges are not always sorted (there might not be a canonical order on the eltype, as above), so I've removed it and moved its example to the middle(array) doc.

codecov-commenter commented 2 years ago

Codecov Report

Merging #116 (17bc7a7) into master (576db0f) will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #116      +/-   ##
==========================================
+ Coverage   96.93%   96.94%   +0.01%     
==========================================
  Files           1        1              
  Lines         424      426       +2     
==========================================
+ Hits          411      413       +2     
  Misses         13       13              
Impacted Files Coverage Ξ”
src/Statistics.jl 96.94% <100.00%> (+0.01%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update 576db0f...17bc7a7. Read the comment docs.

vyu commented 2 years ago

Thanks for the review @nalimilan! I've applied some suggestions.

Sorry for the confusion about the intent of this PR. This was originally intended to restrict the eltype of middle(::AbstractRange) as well, but @ararslan pointed out that's a breaking change, so I've removed that and this PR now has the following effects:

The performance improvement is from eliding bounds checks and avoiding calling length (from a[end], which calls lastindex(a)), which is slow for StepRange as it requires an integer division. Before:

julia> using BenchmarkTools

julia> using Statistics

julia> @btime middle($(Ref(1:1))[]);
  5.054 ns (0 allocations: 0 bytes)

julia> @btime middle($(Ref(1:1:1))[]);
  14.374 ns (0 allocations: 0 bytes)

After:

julia> @btime middle($(Ref(1:1))[]);
  3.627 ns (0 allocations: 0 bytes)

julia> @btime middle($(Ref(1:1:1))[]);
  4.341 ns (0 allocations: 0 bytes)

I can't really think of a test to add for performance. I wouldn't be adverse to adding a benchmark suite, but I've written two elsewhere (https://github.com/JuliaArrays/StaticArrays.jl/pull/952, https://github.com/JuliaCollections/DataStructures.jl/pull/641) and neither got merged, so I'm hesitant to write another one. πŸ˜…

ararslan commented 2 years ago

Fix middle for non-1-based ranges as @ararslan pointed out, but the AbstractRanges in Base are all 1-based so I don't have a simple test for this.

Who needs Base anyway? πŸ˜„

struct WeirdRange{T} <: AbstractRange{T}
    r::UnitRange{T}
end

Base.firstindex(wr::WeirdRange) = -2

Base.lastindex(wr::WeirdRange) = firstindex(wr) + length(wr) - 1

Base.length(wr::WeirdRange) = length(wr.r)

Base.axes(wr::WeirdRange) = (firstindex(wr):lastindex(wr),)

Base.step(wr::WeirdRange) = step(wr.r)

function Base.getindex(wr::WeirdRange, i::Integer)
    i in eachindex(wr) || throw(BoundsError(wr, i))
    return wr.r[i - firstindex(wr) + 1]
end

wr = WeirdRange(0:3)
middle(wr[1], wr[end])  # WRONG: returns 3.0
mean(wr)                # CORRECT: returns 1.5
nalimilan commented 2 years ago

OK. Not sure it's worth testing non-1 based ranges if they don't exist anywhere TBH... But middle(0:typemax(Int)) sounds easy enough to test.

vyu commented 2 years ago

Okay, I've added a test for middle(0:typemax(Int)). Also fixed a CI failure on 1.7 due to a change of exception type in 1.8 by adding a version check.