As discussed on Slack, I get really bad or suboptimal performance for a simple reduction (Multithreaded Monte Carlo).
function estimate_pi_floop_1(attempts)
hits = 0
@floop for i in 1:attempts
x = rand()
y = rand()
if (x^2 + y^2) <= 1
@reduce(hits += 1)
end
end
return 4.0 * (hits / attempts)
end
function estimate_pi_floop_2(attempts)
hits = 0
@floop for i in 1:attempts
x = rand()
y = rand()
if (x^2 + y^2) <= 1
@reduce(hits = 0 + 1)
end
end
return 4.0 * (hits / attempts)
end
function estimate_pi_threads_partitioned(attempts)
nt = Threads.nthreads()
attempts_per_thread = ceil(Int, attempts ÷ nt)
hits = zeros(Int, nt)
Threads.@threads for i in 1:nt
h = 0
for i in 1:attempts_per_thread
x = rand()
y = rand()
if (x^2 + y^2) <= 1
h += 1
end
end
hits[Threads.threadid()] = h
end
return 4.0 * (sum(hits) / attempts)
end
julia> @btime estimate_pi_floop_1(500_000_000)
2.664 s (125000108 allocations: 1.86 GiB)
julia> @btime estimate_pi_floop_2(500_000_000)
258.906 ms (64 allocations: 3.88 KiB)
julia> @btime estimate_pi_threads_partitioned(500_000_000)
208.475 ms (42 allocations: 4.00 KiB)
So
using @reduce(hits = 0 + 1) over @reduce(hits += 1) makes a huge difference
even when using the former we don't get the performance of estimate_pi_threads_partitioned (which, IIUC, should be similar to what FLoops should produce under the hood).
As discussed on Slack, I get really bad or suboptimal performance for a simple reduction (Multithreaded Monte Carlo).
So
@reduce(hits = 0 + 1)
over@reduce(hits += 1)
makes a huge differenceestimate_pi_threads_partitioned
(which, IIUC, should be similar to what FLoops should produce under the hood).Thanks for taking a look!