Closed JakobAsslaender closed 2 years ago
To add, this problem did not occur in v0.7:
using NFFT
using BenchmarkTools
using FFTW
## no issues here
Nsamples = 1_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)
x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)
function test(x,p,d)
img_idx = CartesianIndices((128,128,128))
for i = 1:20
@views nfft_adjoint!(p, d[:,i], x[img_idx,i])
end
end
@benchmark test($x,$p,$d)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 8.201 s (0.00% GC) to evaluate,
with a memory estimate of 1.69 MiB, over 55095 allocations.
Nsamples = 1_000_000
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); fftflags = FFTW.MEASURE)
x = zeros(ComplexF64,128,128,128,20)
d = randn(ComplexF64, Nsamples, 20)
function test(x,p,d)
img_idx = CartesianIndices((128,128,128))
for i = 1:20
@views nfft_adjoint!(p, d[:,i], x[img_idx,i])
end
end
@benchmark test($x,$p,$d)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 112.958 s (0.36% GC) to evaluate,
with a memory estimate of 1.49 GiB, over 59993041 allocations.
pv = [copy(p) for _ = 1:20]
function test(x,p,d)
img_idx = CartesianIndices((128,128,128))
Threads.@threads for i = 1:20
@views nfft_adjoint!(p[i], d[:,i], x[img_idx,i])
end
end
@benchmark test($x,$pv,$d)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 7.988 s (0.00% GC) to evaluate,
with a memory estimate of 1.49 GiB, over 59990225 allocations.
The multi-threading thing is hard to judge. In 0.7 our adjoint was not multi-threaded. Now it is. So effectively you are multi-threading twice. In theory Julia should handle this just fine but apparently there still is some GC issue.
Serial performance as significantly improved or am I missing something from your numbers?
Ok, I figured out what the reason of the allocations might be. Could you please rerun with these parameters:
p = NFFT.NFFTPlan(rand(3,Nsamples) .- 1/2, (128,128,128); m=4, σ=2.0, fftflags = FFTW.MEASURE)
The reason is this line:
https://github.com/JuliaMath/NFFT.jl/blob/master/src/multidimensional.jl#L386
I figure out that for large tuples of complex numbers there is some small performance hit and the compiler generates better code when treating all as real numbers. But for small m
unfortunately the complex version is faster. Without that micro optimization we were slightly slower than FINUFFT in this plot:
If you can confirm that this solves the GC issue on your side I will think how to tweak the default numbers better. Multi-threading should always use the non-allocating complex path for instance.
Not sure I understood your explanation, but m=4, σ=2.0
solved the issue. In fact, the particular adjoint NFFT I was worried about takes 7s, which took 350s with the default settings and 135s with v0.7! 🚀
Awesome. That's a factor of 20. I am really happy to hear this. It shows that the performance work was really worth it. Hopefully this attracts more MRI researchers.
I changed the default parameters such that the special handling is not done in multi-threading scenarios
Great, thanks!
Hi, in v0.12, the adjoint NFFT (using
mul!
) is dominated by garbage collection when the number of samples is large. Consider the following MWE:has the following timing on our server:
However, when I increase
Nsamples
:GC takes up a substantial portion of the computation:
This becomes even more dramatic when multi-threading the individual NFFTs:
This is on a 40-core machine. Any idea what is going on here?