Reduction on array view is slow

ZeppLu commented 5 years ago

julia> using BenchmarkTools; const X = rand(1000, 1000);

julia> sum(X); @benchmark sum(X)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     355.773 μs (0.00% GC)
  median time:      406.994 μs (0.00% GC)
  mean time:        453.882 μs (0.00% GC)
  maximum time:     2.171 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> sum(@view X[1:999, 1:999]); @benchmark sum(@view X[1:999, 1:999])
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.782 ms (0.00% GC)
  median time:      1.795 ms (0.00% GC)
  mean time:        1.815 ms (0.00% GC)
  maximum time:     2.616 ms (0.00% GC)
  --------------
  samples:          2750
  evals/sample:     1

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Compared to python:

In [1]: import numpy as np; X = np.random.random([1000, 1000])

In [2]: %timeit np.sum(X)
377 µs ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit np.sum(X[:999, :999])
752 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This issue holds for prod() as well.

mbauman commented 5 years ago

Oh that's interesting. Nice find, and thanks for the report. I believe the difference here is that Numpy pre-computes the derived strides, whereas we dynamically compute them — this general structure is what allows us to support all classes of indexing within view. That said, it might be possible for us to eke out a bit more performance in specific cases like this one.

(Edit: thanks, prof :) )

stevengj commented 5 years ago

(s/eek/eke/)

ZeppLu commented 5 years ago

One more interesting result:

julia> function mysum(X)
           s = zero(eltype(X))
           @simd for i = eachindex(X)
               @inbounds s += X[i]
           end
           s
       end
mysum (generic function with 1 method)

julia> mysum(@view X[1:999,1:999]) ≈ sum(@view X[1:999,1:999])
true

julia> @benchmark mysum(@view X[1:999,1:999])
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  1
  --------------
  minimum time:     358.658 μs (0.00% GC)
  median time:      397.377 μs (0.00% GC)
  mean time:        426.762 μs (0.00% GC)
  maximum time:     1.681 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function mysum_nosimd(X)
           s = zero(eltype(X))
           for i = eachindex(X)
               @inbounds s += X[i]
           end
           s
       end
mysum_nosimd (generic function with 1 method)

julia> @benchmark mysum_nosimd(@view X[1:999, 1:999])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.531 ms (0.00% GC)
  median time:      1.543 ms (0.00% GC)
  mean time:        1.577 ms (0.00% GC)
  maximum time:     3.486 ms (0.00% GC)
  --------------
  samples:          3163
  evals/sample:     1

ZeppLu commented 5 years ago

I dived into how sum() works, found that sum() calls mapreduce(), then becomes something like Base._mapreduce_dim(identity, +, NamedTuple(), view(X, 1:999, 1:999), :). If I had a debugger I'd be happy to continue digging, though.

mbauman commented 5 years ago

I was just doing the same — but with the new Rebugger.jl. I highly recommend it.

We basically have two different implementations of mapreduce — a highly optimized one for arrays that support fast linear indexing, and a fallback one for all iterables. We could add one or two intermediate optimizations between those two book-ends:

An implementation for AbstractArray that assumes indexability.
And potentially an implementation for StridedArrays. I don't think there'd be extra to gain here, but it could be worth an examination.

el-oso commented 8 months ago

I don't know if this is somewhat related to this issue, but I also found an accuracy issue which I think is more troublesome because it also affects DataFrames.jl

here is the MVP

using Distributions
using Random

# It doesn't happen always, here is a seed where this happens.
rng = MersenneTwister(630);
v = rand(rng, Normal(zero(Float32), one(Float32)), 1000)
sa = @view v[collect(1:end)]

# View (as SubArray) vs Vector
sum(sa) ≈ sum(v) # false
# They are different! and worse part is that the view version is less accurate! (according to Kahan compensated summation)

IndexStyle(v) isa IndexLinear # true
IndexStyle(sa) isa IndexCartesian #true

# They are dispatched to different implementations in base/reduce.jl > _mapreduce

Perhaps I should open a separate issue

JuliaLang / julia

Reduction on array view is slow #28848