Closed baggepinnen closed 4 years ago
Are you not comparing CPU execution for n=10000 (47 seconds) with GPU execution of n=100_000 (19 seconds)? Comparing with the following:
function main(N=100_000, tspan=(0.0f0, 100f0))
@time naive_mc(tspan, N)
CuArrays.reclaim()
CuArrays.@time sim((x,y)->MonteCarloMeasurements.CuParticles(fill(x,N) .+ y .* randn.()), tspan)
return
end
I get:
julia> main(100_000)
31.384719 seconds (36.60 M allocations: 2.816 GiB, 1.06% gc time)
5.914546 seconds (18.54 M CPU allocations: 16.603 GiB, 12.21% gc time) (376.35 k GPU allocations: 273.562 GiB, 7.35% gc time of which 17.34% spent allocating)
julia> main(1_000_000)
319.073272 seconds (366.00 M allocations: 28.163 GiB, 0.97% gc time)
48.553030 seconds (19.81 M CPU allocations: 160.754 GiB, 9.64% gc time) (376.59 k GPU allocations: 2.673 TiB, 1.00% gc time of which 7.64% spent allocating)
Of course, this is on a RTX 5000 with 16GB of memory while your GPU only has 2GB. Trying to replicate that:
$ CUARRAYS_MEMORY_LIMIT="2000000000" julia
julia> main(100_000)
5.675215 seconds (18.43 M CPU allocations: 16.617 GiB, 9.03% gc time) (376.59 k GPU allocations: 273.741 GiB, 18.33% gc time of which 7.87% spent allocating)
julia> main(1_000_000)
51.856641 seconds (29.95 M CPU allocations: 161.215 GiB, 11.74% gc time) (376.59 k GPU allocations: 2.673 TiB, 6.12% gc time of which 1.17% spent allocating)
This still looks OK, 6x improvement across the board... Which version of Julia/CuArrays are you using?
You can also try to use the split
allocator, but it doesn't seem to perform better on this workload:
$ CUARRAYS_MEMORY_LIMIT="2000000000" CUARRAYS_MEMORY_POOL=split julia
julia> main(100_000)
6.613166 seconds (21.05 M CPU allocations: 16.758 GiB, 9.77% gc time) (376.59 k GPU allocations: 273.741 GiB, 10.65% gc time of which 11.62% spent allocating)
julia> main(1_000_000)
65.764215 seconds (32.53 M CPU allocations: 161.358 GiB, 29.77% gc time) (376.59 k GPU allocations: 2.673 TiB, 28.68% gc time of which 6.69% spent allocating)
Finally, also try CuArrays 1.6.1 (release incoming) with Julia 1.4 or master, the GC interactions there are much less costly.
Wow you're that's quite a bit better result. My timing were actually correct and fair, bu the code I posted had an incorrect number for naive_mc
.
Trying CuArrays#master on julia nightly I actually get slightly slower performance
main(10_000)
println()
main(100_000)
println()
main(1_000_000)
9.387509 seconds (27.63 M CPU allocations: 4.926 GiB, 4.65% gc time) (367.88 k GPU allocations: 27.409 GiB, 9.10% gc time of which 16.30% spent allocating)
4.415120 seconds (3.71 M allocations: 289.459 MiB, 0.80% gc time)
17.178087 seconds (29.71 M CPU allocations: 33.135 GiB, 7.82% gc time) (376.35 k GPU allocations: 273.562 GiB, 5.09% gc time of which 12.56% spent allocating)
43.949634 seconds (37.10 M allocations: 2.827 GiB, 0.85% gc time)
405.047600 seconds (29.89 M CPU allocations: 321.338 GiB, 69.88% gc time) (376.59 k GPU allocations: 2.673 TiB, 81.28% gc time of which 0.17% spent allocating)
I should note that I'm running a 4K monitor on the same GPU so it actually only has about 1GiB free for compute. In either case, a beefier GPU seems to provide an OK speedup and I have some updating to do in my paper :)
9.387509 seconds (27.63 M CPU allocations: 4.926 GiB, 4.65% gc time) (367.88 k GPU allocations: 27.409 GiB, 9.10% gc time of which 16.30% spent allocating) 4.415120 seconds (3.71 M allocations: 289.459 MiB, 0.80% gc time)
Is that after warm-up? But yeah, 1GB is not much so it's expected to put a lot of pressure on the GC.
Closing this assuming it isn't a problem. Please reopen if you expect this benchmark to perform well on this limited amount of memory.
As discussed on slack, here is a benchmark that is solving an ODE where each operation is carried out on a
CuArray