JuliaArrays / OffsetArrays.jl

Fortran-like arrays with arbitrary, zero or negative starting indices.
Other
195 stars 40 forks source link

Problem with sum over OffsetArray #329

Open weymouth opened 1 year ago

weymouth commented 1 year ago

It's me again. That guy who keeps finding bugs when using sum. In Polyester, in KernelAbstractions, and now in OffsetArrays. Sigh...

Here is a case which slows down a loop by 25x(!) when using a sum over an OffsetArray. Expanding the sum into a for loop restores the normal performance when using Array.

## Set up
# Cartesian step
@inline δ(i, I::CartesianIndex{m}) where{m} = CartesianIndex(ntuple(j -> j==i ? 1 : 0, m))

# Finite difference
@inline ∂(a,I::CartesianIndex{m},u::AbstractArray{T,n}) where {T,n,m} = @inbounds u[I+δ(a,I),a]-u[I,a]

# Divergence using sum function
@fastmath @inline div_sum(I::CartesianIndex{m},u) where {m} = sum(@inbounds ∂(i,I,u) for i ∈ 1:m)

# Divergence expanded using for loop
@fastmath @inline function div_expanded(I::CartesianIndex{m},u) where {m} 
    init=zero(eltype(u))
    for i in 1:m
     init += @inbounds ∂(i,I,u)
    end
    return init
end

# loop over the range R
loop!(div,u,f,R) = @inbounds @simd for I ∈ R
    div[I] = f(I,u)
end

## Test case
# Benchmark with Arrays
N = (2^10+2,2^10+2) # size of the array including halo
R = CartesianIndices(map(n->n-2,N)).+CartesianIndex(map(n->1,N)) # range inside the halo
div = zeros(Float32,N);
u = rand(Float32,N...,length(N));
using BenchmarkTools
@btime loop!($div,$u,$div_sum,$R) # 188.200 μs (0 allocations: 0 bytes)
@btime loop!($div,$u,$div_expanded,$R) # 193.400 μs (0 allocations: 0 bytes)

# Benchmark with OA
using OffsetArrays: Origin
divOA = Origin(0)(div);
uOA = Origin((zeros(Int,length(N))...,1))(u);
ROA = CartesianIndices(map(n->n-2,N))
@btime loop!($divOA,$uOA,$div_expanded,$ROA) # 200.000 μs (0 allocations: 0 bytes)
@btime loop!($divOA,$uOA,$div_sum,$ROA) # 4.960 ms (0 allocations: 0 bytes) WTH!?!
jishnub commented 1 year ago

Reduced to

julia> @btime mapfoldl(identity, +, (∂(1, I, $uOA) for I in CartesianIndices((1:10, 1:1))));
  54.481 ns (0 allocations: 0 bytes)

julia> @btime mapfoldl(identity, +, (∂(1, I, $u) for I in CartesianIndices((1:10, 1:1))));
  19.071 ns (0 allocations: 0 bytes)

It seems the iteration is slow

julia> @btime (u -> iterate(∂(1, I, u) for I in CartesianIndices((1:10, 1:1))))($uOA);
  12.965 ns (0 allocations: 0 bytes)

julia> @btime (u -> iterate(∂(1, I, u) for I in CartesianIndices((1:10, 1:1))))($u);
  3.365 ns (0 allocations: 0 bytes)

Oddly,

julia> @btime ∂(1, CartesianIndex(1,1), $uOA);
  3.381 ns (0 allocations: 0 bytes)

julia> @btime (I -> ∂(1, I, $uOA))(CartesianIndex(1,1));
  7.636 ns (0 allocations: 0 bytes)