Improve Performance: findmin/findmax

Ellipse0934 commented 4 years ago

The current findmin and findmax functions are slow due to the current algorithm working in two phases (here). It first computes the minimum (computed via mapreduce) and subsequently calls findfirstval to find it's index. This also affects other functions which internally use findmin such as argmin.

It may be possible to have a mapreduce call which returns both the minimum value and the corresponding index but I was unable to come up with such an abstraction.

Alternatively, we could have a function findminmax!(A, dims) similar to base (here) but with the intention of returning both the minimum and maximum so even extrema can be implemented which has not been implemented in CUDA yet.

And yes, I would be happy to work on this.

Ellipse0934 commented 3 years ago

Based on @tkf's implementation of findmax using mapreduce which performed better* than our current version I have written a new version.

There is also a bug in the current implementation. Where NaN is not dealt with properly.

julia> x = [1f0, 2, NaN32, 3];
julia> @test findmax(CuArray(x)) == findmax(x)
Test Failed at REPL[12]:1
  Expression: findmax(CuArray(x)) == findmax(x)
   Evaluated: (NaN32, nothing) == (NaN32, 3)

New version:

function Base.findmax(a::AnyCuArray; dims=:)
    function f(t1::T, t2::T) where T <: Tuple{AbstractFloat, I} where I
        (x, i), (y, j) = t1, t2
        if i > j
            t1, t2 = t2, t1
            (x, i), (y, j) = t1, t2
        end

        # Check for NaN first because NaN == NaN is false
        isnan(x) && return t1
        isnan(y) && return t2
        max(x, y) == x && return t1
        return t2
    end

    function f(t1, t2)
        (x, i), (y, j) = t1, t2

        x < y && return t2
        x == y && return (x, min(i, j))
        return t1
    end

    indx = ndims(a) == 1 ? (eachindex(a), 1) : 
                           (CartesianIndices(a), CartesianIndex{ndims(a)}())
    if dims == Colon()
        mapreduce(tuple, f, a, indx[1]; init = (typemin(eltype(a)), indx[2]))
    else
        res = mapreduce(tuple, f, a, indx[1]; 
                        init = (typemin(eltype(a)), indx[2]), dims=dims)
        vals = map(x->x[1], res)
        inds = map(x->x[2], res)
        return (vals, inds)
    end
end

However the performance isn't necessarily always better than the current version. The reason for this, I suspect are

Current findmax uses reduce which will be very fast for primitive types because of warp-level primitives.
The @atomic writes to global memory aren't very harmful performance-wise in the average case. There is obviously a significant penalty when there are a lot of writes.

I have created a branch with implementations for findmax, findmin and findfirst along with relevant tests here.

Some benchmarks are:

	Old	New
A = CUDA.rand(100, 100, 100, 10); @benchmark CUDA.@sync findmax(A)	BenchmarkTools.Trial: memory estimate: 6.63 KiB allocs estimate: 186 -------------- minimum time: 2.145 ms (0.00% GC) median time: 2.214 ms (0.00% GC) mean time: 2.293 ms (0.00% GC) maximum time: 3.920 ms (0.00% GC) -------------- samples: 2180 evals/sample: 1	BenchmarkTools.Trial: memory estimate: 2.48 KiB allocs estimate: 82 -------------- minimum time: 3.544 ms (0.00% GC) median time: 3.752 ms (0.00% GC) mean time: 3.802 ms (0.00% GC) maximum time: 11.979 ms (0.00% GC) -------------- samples: 1315 evals/sample: 1
A = CUDA.ones(Float32, 100, 100, 100, 10); @benchmark CUDA.@sync findmax(A) # Testing collisions	BenchmarkTools.Trial: memory estimate: 6.63 KiB allocs estimate: 186 -------------- minimum time: 7.161 ms (0.00% GC) median time: 7.506 ms (0.00% GC) mean time: 7.525 ms (0.00% GC) maximum time: 8.614 ms (0.00% GC) -------------- samples: 665 evals/sample: 1	BenchmarkTools.Trial: memory estimate: 2.48 KiB allocs estimate: 82 -------------- minimum time: 3.539 ms (0.00% GC) median time: 3.760 ms (0.00% GC) mean time: 3.800 ms (0.00% GC) maximum time: 5.335 ms (0.00% GC) -------------- samples: 1316 evals/sample: 1
A = CUDA.rand(100, 100, 100, 10); @benchmark CUDA.@sync findmax(A; dims=2)	BenchmarkTools.Trial: memory estimate: 5.81 KiB allocs estimate: 153 -------------- minimum time: 10.910 ms (0.00% GC) median time: 11.427 ms (0.00% GC) mean time: 11.534 ms (0.00% GC) maximum time: 13.086 ms (0.00% GC) -------------- samples: 434 evals/sample: 1	BenchmarkTools.Trial: memory estimate: 1.36 KiB allocs estimate: 47 -------------- minimum time: 17.079 ms (0.00% GC) median time: 17.911 ms (0.00% GC) mean time: 18.140 ms (0.00% GC) maximum time: 20.230 ms (0.00% GC) -------------- samples: 276 evals/sample: 1
A = CUDA.ones(Float32, 100, 100, 100, 10); @benchmark CUDA.@sync findmax(A; dims=2)	BenchmarkTools.Trial: memory estimate: 4.09 KiB allocs estimate: 140 -------------- minimum time: 6.427 ms (0.00% GC) median time: 6.798 ms (0.00% GC) mean time: 6.815 ms (0.00% GC) maximum time: 8.884 ms (0.00% GC) -------------- samples: 734 evals/sample: 1	BenchmarkTools.Trial: memory estimate: 1.36 KiB allocs estimate: 47 -------------- minimum time: 17.079 ms (0.00% GC) median time: 17.911 ms (0.00% GC) mean time: 18.140 ms (0.00% GC) maximum time: 20.230 ms (0.00% GC) -------------- samples: 276 evals/sample: 1
A = CUDA.ones(100_000_000) @benchmark CUDA.@sync findmax(A)	BenchmarkTools.Trial: memory estimate: 4.08 KiB allocs estimate: 139 -------------- minimum time: 59.127 ms (0.00% GC) median time: 59.949 ms (0.00% GC) mean time: 60.815 ms (0.00% GC) maximum time: 64.964 ms (0.00% GC) -------------- samples: 83 evals/sample: 1	BenchmarkTools.Trial: memory estimate: 2.11 KiB allocs estimate: 78 -------------- minimum time: 3.319 ms (0.00% GC) median time: 3.561 ms (0.00% GC) mean time: 3.583 ms (0.00% GC) maximum time: 4.743 ms (0.00% GC) -------------- samples: 1395 evals/sample: 1

I am unsure if the version I implemented is PR worthy performance wise. The route for the best performance is perhaps to make a custom kernel which uses warp reduction for primitive types then conducts a ballot for minimum index, uses @atomic with local and later global mem, finally storing the final result directly. Seems time consuming at the moment because we also need to take care of multi-dimensional Julia arrays. I may revisit this at a later point.

Ellipse0934 commented 3 years ago

Don't close this issue yet. One way to significantly improve is to use warp shuffle for finding the min/max for supported primitive types. Outlining a rough sketch of the steps.

Use shfl_down_sync to find the minimum element of a warp
Use vote_ballot to find the elements which match the minimum element
Conduct another vote_ballot to check for NaN when eltype(array) <: AbstractFloat
Replace the original vote_ballot if NaN is present
The lowest lane number with the minimum element is the least significant bit of the ballot.(Use CUDA.ffs)
Store the minimum val, indx of each warp in shared memory, sync_threads() then have the first warp do the final computation.

JuliaGPU / CUDA.jl

Improve Performance: findmin/findmax #320