performance regression on high-dimensional array iteration using CartesianIndices (no simd)

johnnychen94 commented 3 years ago

It turns out that #37829 has increased iteration performance for 2d array, while slowed down the iteration for higher-dimensional(>=4) array...

julia> using BenchmarkTools

julia> function arr_sum(X)
           val = zero(eltype(X))
           R = CartesianIndices(X)
           for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X)
  5.584 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  5.790 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.575 μs (0 allocations: 0 bytes) # 9405bf51a726a6383e6911eeb4235ba21ab3daee
  3.572 μs (0 allocations: 0 bytes) # 1.5.2
  5.959 μs (0 allocations: 0 bytes) # 1.0.5

julia> X = rand(64, 64);

julia> @btime arr_sum($X)
  3.627 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.734 μs (0 allocations: 0 bytes) # 1.5.2

SIMD and LinearIndices are not affected.

simd

```julia julia> using BenchmarkTools julia> function arr_sum_simd(X) val = zero(eltype(X)) R = CartesianIndices(X) @simd for i in R @inbounds val += X[i] end val end arr_sum_simd (generic function with 1 method) julia> X = rand(4, 4, 4, 4, 4, 4); julia> @btime arr_sum_simd($X) 3.593 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262 3.827 μs (0 allocations: 0 bytes) # 1.5.2 3.585 μs (0 allocations: 0 bytes) # 1.0.5 ```

LinearIndices

```julia julia> using BenchmarkTools julia> function arr_sum_linear(X) val = zero(eltype(X)) R = LinearIndices(X) for i in R @inbounds val += X[i] end val end arr_sum_linear (generic function with 1 method) julia> X = rand(4, 4, 4, 4, 4, 4); julia> @btime arr_sum_linear($X) 3.707 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262 3.626 μs (0 allocations: 0 bytes) # 1.5.2 3.796 μs (0 allocations: 0 bytes) # 1.0.5 ```

johnnychen94 commented 3 years ago

This is the benchmark result on array of 4096 elements at various dimensions.

OffsetArrays 1.3.1 is used as a comparison, which uses IdOffsetRange as its axes type. It is very interesting that OffsetArray with Julia 1.6.0-DEV gets relatively the best performance

test

dataframe

```text 12×4 DataFrame │ Row │ 1.5.2 │ 1.5.2-OffsetArrays │ 1.6.0-DEV.1262 │ 1.6.0-DEV.1262-OffsetArrays │ │ │ Any │ Any │ Any │ Any │ ├─────┼────────────┼────────────────────┼────────────────┼─────────────────────────────┤ │ 1 │ 3.70512e-6 │ 3.70813e-6 │ 3.706e-6 │ 3.73338e-6 │ │ 2 │ 3.99612e-6 │ 3.71112e-6 │ 3.70863e-6 │ 3.73338e-6 │ │ 3 │ 3.7115e-6 │ 3.71537e-6 │ 3.86275e-6 │ 4.44312e-6 │ │ 4 │ 3.71663e-6 │ 4.54988e-6 │ 4.746e-6 │ 3.83937e-6 │ │ 5 │ 3.72325e-6 │ 3.73e-6 │ 4.20529e-6 │ 3.84325e-6 │ │ 6 │ 3.72862e-6 │ 4.00812e-6 │ 5.94833e-6 │ 4.15257e-6 │ │ 7 │ 3.73425e-6 │ 3.73938e-6 │ 6.735e-6 │ 3.7555e-6 │ │ 8 │ 8.007e-6 │ 9.16033e-6 │ 7.81225e-6 │ 3.84125e-6 │ │ 9 │ 9.262e-6 │ 1.1286e-5 │ 8.5115e-6 │ 3.8475e-6 │ │ 10 │ 1.0439e-5 │ 1.191e-5 │ 1.1283e-5 │ 4.20012e-6 │ │ 11 │ 1.1511e-5 │ 1.3294e-5 │ 1.2812e-5 │ 4.76257e-6 │ │ 12 │ 1.3063e-5 │ 1.6122e-5 │ 1.2355e-5 │ 5.18617e-6 │ ```

test script

```julia using Test using BenchmarkTools using Plots using Random using OffsetArrays # 1.3.1 function arr_sum(X) val = zero(eltype(X)) R = CartesianIndices(X) for i in R @inbounds val += X[i] end val end sz_list = [ (4096, ), (2048, 2), (1024, 2, 2), ( 512, 2, 2, 2), ( 256, 2, 2, 2, 2), ( 128, 2, 2, 2, 2, 2), ( 64, 2, 2, 2, 2, 2, 2), ( 32, 2, 2, 2, 2, 2, 2, 2), ( 16, 2, 2, 2, 2, 2, 2, 2, 2), ( 8, 2, 2, 2, 2, 2, 2, 2, 2, 2), ( 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) ] X = axes(sz_list, 1) ### run in Julia 1.5.2 sz = sz_list[2] A = rand(sz...); AO = OffsetArray(A, OffsetArrays.Origin(1)); @assert arr_sum(A) == arr_sum(AO) T15O = [] for sz in sz_list t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1))) @show sz t push!(T15O, t) end T15 = [] for sz in sz_list t = @belapsed arr_sum(A) setup=(A=rand($sz...)) @show sz t push!(T15, t) end ### run in Julia 1.6.0-DEV sz = sz_list[2] Random.seed!(0) A = rand(sz...); AO = OffsetArray(A, OffsetArrays.Origin(1)); @assert arr_sum(A) == arr_sum(AO) T16O = [] for sz in sz_list t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1))) @show sz t push!(T16O, t) end T16 = [] for sz in sz_list t = @belapsed arr_sum(A) setup=(A=rand($sz...)) @show sz t push!(T16, t) end scatter(X, [T15, T16], labels="Array", markershape=:circle) scatter!(X, [T15O, T16O], labels="OffsetArray", markershape=:rect) plot!(X, [T15, T15O], labels="1.5.2", linestyle=:dash) plot!(X, [T16, T16O], labels="1.6.0-DEV.1262", legend=:topleft) ```

timholy commented 3 years ago

Interesting. So basically everything in 1.6 is no worse than 1.5, with the minor exception of Array in 6 and 7 dimensions?

kimikage commented 3 years ago

I read https://github.com/JuliaLang/julia/issues/38086#issuecomment-715495220 and wondered if there was something going on with the bounds check (or inbounds propagation).

function arr_sum_both(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        @inbounds val += X[i]
    end
    val
end

function arr_sum_outeronly(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        val += X[i]
    end
    val
end

julia> VERSION
v"1.6.0-DEV.1322"

julia> @btime arr_sum($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.267 μs (0 allocations: 0 bytes)

This may be a separate issue, but it is weird that arr_sum_outeronly is slower.

johnnychen94 commented 3 years ago

This is what I get now with two repeated benchmarks:

julia> VERSION
v"1.7.0-DEV.36"

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X);
  5.540 μs (0 allocations: 0 bytes)
  5.538 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.221 μs (0 allocations: 0 bytes)
  5.582 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.110 μs (0 allocations: 0 bytes)
  5.223 μs (0 allocations: 0 bytes)

This difference might just be noises.

kimikage commented 3 years ago

I also checked again and found no difference in the native code depending on the position of @inbounds. However, it seems certain that the noise is not random. The internal state of the CPU, such as the caches, may affect it in this case.

On the other hand, in the case of https://github.com/JuliaLang/julia/issues/38086#issuecomment-715495220, there is a clear difference.

KristofferC commented 3 years ago

Seems like this caused a similar problem in https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085 as was fixed in https://github.com/JuliaLang/julia/pull/39333.

https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085/12 has an MWE.

Adding @inbouds @simd outside the Cartesian loop seems to work arund it.

kimikage commented 3 years ago

IIRC, this might have something to do with https://github.com/JuliaLang/julia/issues/39700#issuecomment-781030776

However, I have no knowledge about the countermeasures on the compiler side.

JuliaLang / julia

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073