Mapreduce is 10x slower than loop.

oscardssmith commented 3 years ago

A slack conversation led me to the realization that currently the following two functions have very different speeds.

function f(x, y)
    return @inbounds mapreduce(==,+,x, y)
end

function f2(x, y)
    total=0
    @inbounds for i in 1:length(x)
        total += x[i]==y[i]
    end
    return total
end
x = randn(10240); y = similar(x)
@btime f2(x,y)
  1.255 μs (0 allocations: 0 bytes)

@btime f(x,y)
  9.207 μs (1 allocation: 10.19 KiB)

This is clearly unfortunate as the mapreduce version is much clearer. Is there any way we can make this case for mapreduce not have O(n) allocation? (or otherwise be faster)

jishnub commented 3 years ago

Perhaps this is because map allocates an intermediate array? The performance improves noticeably using MappedArrays:

julia> @btime f($x,$y);
  22.352 μs (1 allocation: 10.19 KiB)

julia> @btime f2($x,$y);
  2.178 μs (0 allocations: 0 bytes)

julia> using MappedArrays

julia> function f3(x,y)
           reduce(+, mappedarray(==, x, y))
       end
f3 (generic function with 1 method)

julia> @btime f3($x,$y);
  4.934 μs (0 allocations: 0 bytes)

Still not quite as good as the loop, but almost there.

timholy commented 3 years ago

If Generators were faster we could just delete the specialization for multiple arrays and use https://github.com/JuliaLang/julia/blob/19b3ec579e8da7b07ae667cd34ab9ede95d4a2a6/base/reduce.jl#L288

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

antoine-levitt commented 3 years ago

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

Yes please. Generators are super useful and I use them all the time (much more than higher-order functions).

JeffBezanson commented 3 years ago

I get 1.3 us for the loop, 10.4 us for mapreduce, and 7.2 us for the generator version (no allocations). So generators are already faster in this case, but yes need to be faster still.

timholy commented 3 years ago

As a reasonable goalpost, here's one more timing comparison:

Generator version: 5.969 μs (0 allocations: 0 bytes)
MappedArrays version: 1.631 μs (0 allocations: 0 bytes)
f2 version: 1.430 μs (0 allocations: 0 bytes)

For me the gap between the explicit loop and MappedArrays is very small.

stevengj commented 3 years ago

f4(x,y) = mapreduce(((x,y),)->x==y,+,zip(x,y)) also has no allocations and is faster, but it's still not as fast as the loop.

oscardssmith commented 3 years ago

That would be a pretty easy to do automatically, right?

oscardssmith commented 3 years ago

@timholy is there any good reason why Generators are slow? Is there low hanging fruit? If you have a place to point me, I'd be glad to take a look.

wheeheee commented 9 months ago

It's been a while, but I think I get reasonable performance from this:

f1(x, y) = mapreduce(==, +, x, y)

function f2(x, y)
    total = 0
    for i in eachindex(x, y)
        total += @inbounds x[i] == y[i]
    end
    return total
end

function mapreduce_same(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    mapreduce(tup_f, op, eachindex(A...); kw...)
end

f3(x, y) = mapreduce_same(==, +, x, y; init=0)

My benchmarks give

julia> @btime f1($x, $y);
  2.489 μs (1 allocation: 10.19 KiB)
julia> @btime f2($x, $y);
  1.250 μs (0 allocations: 0 bytes)
julia> @btime f3($x, $y);
  1.380 μs (0 allocations: 0 bytes)

There are a lot of problems with this, like breaking current behaviour for (at least) a mixture of Vectors with different lengths from the other arrays, and not working with arbitrary indices. Aside from that, if I try to use LinearIndices like this

function mapreduce_cart(f, op, A::Vararg{AbstractArray, N}; kw...) where N
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    inds = eachindex(A...)
    mapreduce(tup_f, op, LinearIndices(inds); kw...)
end
f4(x, y) = mapreduce_cart(==, +, x, y; init=0)

Performance is degraded and the function now allocates.

julia> @btime f4($x, $y);
  1.550 μs (3 allocations: 80 bytes)

Why does this happen?

mcabbott commented 6 months ago

@wheeheee that's not a bad idea, similar to MappedArrays but in a few lines.

Am not sure what LinearIndices is there for; I think that instead you want to reshape in cases where eachindex gives a UnitRange, so that inds always has the same size as the first array. Then reductions with dims will also work:

function mapreduce_same_reshape(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    # inds = reshape(eachindex(A...), axes(A[1]))  # this change allows for dims=2 etc
    ei = eachindex(A...)  # when this is CartesianIndices, reshaping it makes it slow, and isn't necc
    inds = ei isa AbstractUnitRange ? reshape(ei, axes(A[1])) : ei
    mapreduce(tup_f, op, inds; kw...)
end

mapreduce_same_reshape(*, +, [1 2; 3 4], [5 6; 7 8]; dims=1) == [26 44]

Benchmarks for all the above functions here:

https://gist.github.com/mcabbott/4746e69f321909c3ba209518dc0447bb

wheeheee commented 6 months ago

For the life of me, I can't remember what I used the LinearIndices for... Anyway, omitting inbounds in mapreduce_same, for this function at least, makes it faster. I don't know what compiler magic does this, but auto-vectorization is still quite brittle. Also, I remember noticing that sometimes adding type parameters to f and op made the function faster for >= 3 arrays

JuliaLang / julia

Mapreduce is 10x slower than loop. #38558