Suboptimal performance in reduction

As discussed on Slack, I get really bad or suboptimal performance for a simple reduction (Multithreaded Monte Carlo).

function estimate_pi_floop_1(attempts)
    hits = 0
    @floop for i in 1:attempts
        x = rand()
        y = rand()
        if (x^2 + y^2) <= 1
            @reduce(hits += 1)
        end
    end
    return 4.0 * (hits / attempts)
end

function estimate_pi_floop_2(attempts)
    hits = 0
    @floop for i in 1:attempts
        x = rand()
        y = rand()
        if (x^2 + y^2) <= 1
            @reduce(hits = 0 + 1)
        end
    end
    return 4.0 * (hits / attempts)
end

function estimate_pi_threads_partitioned(attempts)
    nt = Threads.nthreads()
    attempts_per_thread = ceil(Int, attempts ÷ nt)
    hits = zeros(Int, nt)
    Threads.@threads for i in 1:nt
        h = 0
        for i in 1:attempts_per_thread
            x = rand()
            y = rand()
            if (x^2 + y^2) <= 1
                h += 1
            end
        end
        hits[Threads.threadid()] = h
    end
    return 4.0 * (sum(hits) / attempts)
end

julia> @btime estimate_pi_floop_1(500_000_000)
  2.664 s (125000108 allocations: 1.86 GiB)

julia> @btime estimate_pi_floop_2(500_000_000)                                                                                                                             
  258.906 ms (64 allocations: 3.88 KiB)

julia> @btime estimate_pi_threads_partitioned(500_000_000)
  208.475 ms (42 allocations: 4.00 KiB)

using @reduce(hits = 0 + 1) over @reduce(hits += 1) makes a huge difference
even when using the former we don't get the performance of estimate_pi_threads_partitioned (which, IIUC, should be similar to what FLoops should produce under the hood).

Thanks for taking a look!

JuliaFolds / FLoops.jl

Suboptimal performance in reduction #126