Performance regression in summing over StepRange and UnitRange (no longer O(1))

KristofferC commented 3 years ago

Code to repro:

using BenchmarkTools

A = 100000:-1:1
function perf_sumcartesian(A)
    s = zero(eltype(A))
    for I in CartesianIndices(size(A))
        val = Base.unsafe_getindex(A, I)
        s += val
    end
    return s
end

@btime perf_sumcartesian($A)

On master:

  52.115 μs (0 allocations: 0 bytes)

On 1.5:

  5.850 ns (0 allocations: 0 bytes

KristofferC commented 3 years ago

Bisected to 17a3c7702e2cb20171d1211606343fc50533a588 cc @johnnychen94

kimikage commented 3 years ago

Just FYI, the same workaround as #38086 is available. However, this issue is not the issue with SIMD ( simd_index). The @simd here only assists in loop unrolling (and/or LICM), not vectorization.

function perf_sumcartesian_simd(A)
    s = zero(eltype(A))
    @inbounds @simd for I in CartesianIndices(size(A))
        val = Base.unsafe_getindex(A, I)
        s += val
    end
    return s
end

julia> @btime perf_sumcartesian_simd($A)
  6.000 ns (0 allocations: 0 bytes)

kimikage commented 3 years ago

Changing the implementation of the iterator will change the optimization result. However, I cannot find any problem with the current implementation of iterate in terms of inlining cost, type stability or bounds checking.

using Base.IteratorsMD: __inc
@inline function Base.iterate(iter::CartesianIndices{1}, state) # FIXME: this is just for experimentation
    next = iterate(iter.indices[1], state.I[1])
    next === nothing && return nothing
    return CartesianIndex(next[1]), CartesianIndex(next[1])
end

Edit This is just for experimentation, but I think it might be a good idea to delegate all the calculations of range and step to the iterators. However, since there is no guarantee that the index and state will match, each must be managed separately. That can increase the overhead in high dimensional indices.

kimikage commented 3 years ago

It does not seem to be optimized when using __is_valid_range. However, it is better to have it for safety. Iterators such as UnitRange are not range-checked, though. https://github.com/JuliaLang/julia/blob/7853ddda3481dac54fe9f1b77d47721476afa295/base/multidimensional.jl#L412-L420

julia> iterate(1:10, -100)
(-99, -99)

julia> iterate(1:10, 100)
(101, 101)

Edit: The implementation of state is not public, so I don't think we need to take care of the cases where the user breaks the state.

kimikage commented 3 years ago

This is just for experimentation, but I think it might be a good idea to delegate all the calculations of range and step to the iterators.

The implementation is quite simple. However, it is unclear whether this breaking change could be widely accepted. (Edit: Of course, I won't be proposing this for v1.6.)

cc: @johnnychen94, @timholy

Edit2: Hold on a second. Isn't this a reinvention of ProductIterator? :joy:

my original intention

```julia import Base: iterate using Base: tail, heads using Base.Iterators: Reverse # in this style, # state = ((ind1, state1), (ind2, state2), ..., (indN, stateN)) @inline function iterate(iter::CartesianIndices) next = _iterate_nested_iters(iter.indices) next === nothing && return nothing return CartesianIndex(map(n -> n[1], next)), next end @inline function iterate(iter::CartesianIndices, state) next = _iterate_nested_iters(iter.indices, state) next === nothing && return nothing return CartesianIndex(map(n -> n[1], next)), next end # please use `reverse()` instead of `Reverse()` @inline function iterate(r::Reverse{<:CartesianIndices}) iterate(reverse(r.itr)) end @inline function iterate(r::Reverse{<:CartesianIndices}, state) iterate(reverse(r.itr), state) end function _iterate_nested_iters(indices) next = iterate(indices[1]) next === nothing && return nothing length(indices) == 1 && return (next, ) tnext = _iterate_nested_iters(tail(indices)) tnext === nothing && return nothing return (next, tnext...) end function _iterate_nested_iters(indices, state) next = iterate(indices[1], state[1][2]) if next === nothing length(indices) == 1 && return nothing tnext = _iterate_nested_iters(tail(indices), tail(state)) tnext === nothing && return nothing h = iterate(indices[1]) h === nothing && return nothing return (h, tnext...) end return (next, tail(state)...) end ```

:sweat_smile:

using Base.Iterators
using Base.Iterators: ProductIterator

@inline function iterate(iter::CartesianIndices, state=nothing)
    p = ProductIterator(iter.indices)
    next = state === nothing ? iterate(p) : iterate(p, state)
    next === nothing && return nothing
    return CartesianIndex(next[1]), next[2]
end

kimikage commented 3 years ago

xref: #35074, #35036, #36832

brenhinkeller commented 1 year ago

Still reproduces on 1.8.3 and 1.9.0-alpha1

adienes commented 10 months ago

I believe no longer reproduces

JuliaLang / julia

Performance regression in summing over StepRange and UnitRange (no longer O(1)) #39700