GC dominates the computation time of adjoint NFFT with large number of samples

JakobAsslaender commented 2 years ago

Hi, in v0.12, the adjoint NFFT (using mul!) is dominated by garbage collection when the number of samples is large. Consider the following MWE:

using NFFT
using BenchmarkTools
using FFTW

##
Nsamples = 1_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)

x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)

pa = adjoint(p)
function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    for i = 1:20
        @views mul!(x[img_idx,i], p, d[:,i])
    end
end
@benchmark test($x,$pa,$d)

has the following timing on our server:

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 7.928 s (0.00% GC) to evaluate,
 with a memory estimate of 19.61 MiB, over 328162 allocations.

However, when I increase Nsamples:

Nsamples = 1_000_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)

x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)

pa = adjoint(p)
function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    for i = 1:20
        @views mul!(x[img_idx,i], p, d[:,i])
    end
end
@benchmark test($x,$pa,$d)

GC takes up a substantial portion of the computation:

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 37.509 s (24.52% GC) to evaluate,
 with a memory estimate of 6.01 GiB, over 61335769 allocations.

This becomes even more dramatic when multi-threading the individual NFFTs:

pv = [copy(p) for _ = 1:20]
function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    Threads.@threads for i = 1:20
        @views mul!(x[img_idx,i], adjoint(p[i]), d[:,i])
    end
end

@benchmark test($x,$pv,$d)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 10.445 s (71.74% GC) to evaluate,
 with a memory estimate of 5.97 GiB, over 60074760 allocations.

This is on a 40-core machine. Any idea what is going on here?

JakobAsslaender commented 2 years ago

To add, this problem did not occur in v0.7:

using NFFT
using BenchmarkTools
using FFTW

## no issues here
Nsamples = 1_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)

x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)

function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    for i = 1:20
        @views nfft_adjoint!(p, d[:,i], x[img_idx,i])
    end
end
@benchmark test($x,$p,$d)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 8.201 s (0.00% GC) to evaluate,
 with a memory estimate of 1.69 MiB, over 55095 allocations.

Nsamples = 1_000_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)

x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)

function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    for i = 1:20
        @views nfft_adjoint!(p, d[:,i], x[img_idx,i])
    end
end
@benchmark test($x,$p,$d)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 112.958 s (0.36% GC) to evaluate,
 with a memory estimate of 1.49 GiB, over 59993041 allocations.

pv = [copy(p) for _ = 1:20]
function test(x,p,d)
    img_idx = CartesianIndices((128,128,128))
    Threads.@threads for i = 1:20
        @views nfft_adjoint!(p[i], d[:,i], x[img_idx,i])
    end
end

@benchmark test($x,$pv,$d)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 7.988 s (0.00% GC) to evaluate,
 with a memory estimate of 1.49 GiB, over 59990225 allocations.

tknopp commented 2 years ago

The multi-threading thing is hard to judge. In 0.7 our adjoint was not multi-threaded. Now it is. So effectively you are multi-threading twice. In theory Julia should handle this just fine but apparently there still is some GC issue.

Serial performance as significantly improved or am I missing something from your numbers?

tknopp commented 2 years ago

Ok, I figured out what the reason of the allocations might be. Could you please rerun with these parameters:

p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); m=4, σ=2.0, fftflags = FFTW.MEASURE)

The reason is this line: https://github.com/JuliaMath/NFFT.jl/blob/master/src/multidimensional.jl#L386 I figure out that for large tuples of complex numbers there is some small performance hit and the compiler generates better code when treating all as real numbers. But for small m unfortunately the complex version is faster. Without that micro optimization we were slightly slower than FINUFFT in this plot:

If you can confirm that this solves the GC issue on your side I will think how to tweak the default numbers better. Multi-threading should always use the non-allocating complex path for instance.

JakobAsslaender commented 2 years ago

Not sure I understood your explanation, but m=4, σ=2.0 solved the issue. In fact, the particular adjoint NFFT I was worried about takes 7s, which took 350s with the default settings and 135s with v0.7! 🚀

tknopp commented 2 years ago

Awesome. That's a factor of 20. I am really happy to hear this. It shows that the performance work was really worth it. Hopefully this attracts more MRI researchers.

tknopp commented 2 years ago

I changed the default parameters such that the special handling is not done in multi-threading scenarios

JakobAsslaender commented 2 years ago

Great, thanks!

JuliaMath / NFFT.jl

GC dominates the computation time of adjoint NFFT with large number of samples #92