JuliaArrays / OffsetArrays.jl

Fortran-like arrays with arbitrary, zero or negative starting indices.
Other
198 stars 40 forks source link

getindex overhead #166

Open johnnychen94 opened 4 years ago

johnnychen94 commented 4 years ago

I'm recently building up some cache array with OffsetArrays and realized the performance bottleneck becomes getindex(::OffsetArray, I).

The benchmark result looks interesting; unsure why arr_sum runs faster on OffsetArray 🤔 Any ideas?

using OffsetArrays

X = rand(4, 4, 4, 4, 4, 4);
XO = OffsetArray(X, -1, -2, -3, 1, 2, 3);

function arr_sum(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    for i in R
        @inbounds val += X[i]
    end
    val
end

@btime arr_sum($X) # 5.215 μs (0 allocations: 0 bytes)
@btime arr_sum($XO) # 3.730 μs (0 allocations: 0 bytes)
@btime getindex($X, 1, 1, 1, 1, 1, 1) # 1.983 ns (0 allocations: 0 bytes)
@btime getindex($XO, 3, 2, 1, 2, 3, 4) # 5.855 ns (0 allocations: 0 bytes)

getindex_inbounds(X, inds...) = @inbounds X[inds...]
@btime getindex_inbounds($X, 1, 1, 1, 1, 1, 1) # 1.430 ns (0 allocations: 0 bytes)
@btime getindex_inbounds($XO, 3, 2, 1, 2, 3, 4) # 2.323 ns (0 allocations: 0 bytes)

The default checkbounds implementation definitely takes too long here. I believe the additional time is spent on the construction of IdOffsetRange and its generic and thus slower getindex.

julia> @btime axes($X);
  1.431 ns (0 allocations: 0 bytes)

julia> @btime axes($XO);
  4.763 ns (0 allocations: 0 bytes)

These might be benchmark artifacts, though.

jishnub commented 4 years ago

For the record I obtain the same run-times

julia> @btime arr_sum($X);
  6.528 μs (0 allocations: 0 bytes)

julia> @btime arr_sum($XO);
  6.535 μs (0 allocations: 0 bytes)

This is odd, as indexing is definitely faster with Arrays

julia> @btime getindex($X, 1, 1, 1, 1, 1, 1);
  3.768 ns (0 allocations: 0 bytes)

julia> @btime getindex($XO, 3, 2, 1, 2, 3, 4);
  14.857 ns (0 allocations: 0 bytes)

julia> getindex_inbounds(X, inds...) = @inbounds X[inds...];

julia> @btime getindex_inbounds($X, 1, 1, 1, 1, 1, 1);
  2.694 ns (0 allocations: 0 bytes)

julia> @btime getindex_inbounds($XO, 3, 2, 1, 2, 3, 4);
  4.960 ns (0 allocations: 0 bytes)
johnnychen94 commented 4 years ago

For the record I obtain the same run-times

Sorry I wasn't aware there's a performance regression in julia 1.6-dev, see also https://github.com/JuliaLang/julia/issues/38073

timholy commented 4 years ago

Is the "raw" indexing time relevant? Any performance-sensitive operation is presumably in a loop, and in a loop won't the compiler hoist all the slow parts out of the loop?

As evidence, if I define arr_sum_bc identically to arr_sum but without the @inbounds, here's what I get:

julia> @btime arr_sum($X)
  5.992 μs (0 allocations: 0 bytes)
2069.986834289989

julia> @btime arr_sum($XO)
  4.093 μs (0 allocations: 0 bytes)
2069.986834289989

julia> @btime arr_sum_bc($X)
  7.791 μs (0 allocations: 0 bytes)
2069.986834289989

julia> @btime arr_sum_bc($XO)
  5.022 μs (0 allocations: 0 bytes)
2069.986834289989

Despite the fact that an isolated getindex is slower, in the context of a loop the slow parts are done just once so the performance does not suffer.

Bad inlining could nix this, but since we've annotated with @inline I doubt that will be a problem for most uses.

I don't understand why OffsetArrays are faster than Arrays, though.