Closed robertdj closed 2 years ago
nice, so funny seeing this stuff landing in this package. I worked on the original multithreading code in Julia https://github.com/JuliaLang/julia/pull/6741
Regarding FFTW: You just have to call FFTW.set_num_threads
.
I didn't realize that multithreading in Julia was so old!
Would it make sense to add
FFTW.set_num_threads(Threads.nthread())
in NFFT?
Yes this sounds good. Although this probably needs a more general solution since its always a little bit bas if we change global state in a user library. Maybe
if Threads.nthread() > 1
FFTW.set_num_threads(Threads.nthread())
end
is a little safer since we would not set it to 1 if a user runs Julia in serial mode but has set FFTW to be parallel
Good point with the global state.
There's not much documentation for FFTW.set_num_threads
, but do you know if it can be set only for the NFFT module?
No thats not possible. Its a global value in libfftw
. There seems to be currently even no getter so that we cannot know to what is had been set.
But I would not care to much. More interesting is how good your speed ups are. Parallel nfft is quite nice and that often seen in the wild... ;-)
I'm currently on an old computer with only 2 threads and here it's about 30% faster for a 2D example I'm testing.
More timing suggests that the threading constructs take quite some time. E.g. for 2D case in nfft_performance
:
Hey, that not that bad. In multithreading its not that simple to achieve full speedups. So this is not necessary an issue with the Julia implementation of threads.
The issue in multithreading is usually memory accesses to similar regions from different threads. This will lead fighting CPU caches.
Do we already sort the k-space nodes? Might be a good idea...
Are the k-nodes part of the NFFTPlan
?
I meant x
https://github.com/tknopp/NFFT.jl/blob/master/src/NFFT.jl#L56
I usually use NFFT for MRI reconstruction where x
is k
(Fourier space is named k
-space)
We don't currently sort x
. I haven't thought about this before, but indexing the input and window arrays would be more cache efficient with a suitable sorting of x
.
Let me check how sorting affects the timing.
One problem with sorting in the plan creation is that fHat
must be sorted similarly...
Yes indeed. One might for with an indirect index. Might make sense to look how the solved this in the NFFT C library:
https://github.com/NFFT/nfft/blob/9fbc5ab42fe6c690d2b82b971b3566be4e6df165/kernel/nfft/nfft.c#L111
I have merged your multithreaded convolve into master
@tknopp It seems to be more difficult to utilize multithreading with convolve_adjoint
since we cannot work independently with each k
iteration. Do you have an idea how to get around this?
Yes, this is a little trickier and I don't have a direct solution. However, look at this from the manual:
Threads.@threads for i = 1:10
a[i] = Threads.threadid()
end
It means that you have access to the threadId. We might need to have some temporary array within the loop where intermediate results from the sum are stored. And then a final sequential step to combine the result. An alternative would be a lock within the loop. But I don't know if locks are exposed within the threading interface.
There is in base Threads.atomic_add!
. Have not played around with that. Also not sure if it is actually feasible having an array of atomics...
A possibility to speed-up calculations is to use FFTW planning flags: plan_fft!(tmpVec, flags = FFTW.MEASURE)
. This will spend more time in the planning phase though (but I think this isn't usually a problem).
As a side note, I wouldn't force the number of FFTW threads to the number of Julia threads, the user should set them explicitely.
I agree that it would be nice to expose the features that FFTW exposes. We may do this by simply forwarding the kargs
in the NFFT constructor
.
Why would one use a different number of threads in the Julia multi-threading part in contrast to the FFTW multithreading part?
Why would one use a different number of threads in the Julia multi-threading part in contrast to the FFTW multithreading part?
Well, they're different things. I guess this isn't used in NFFT
, but nesting a threaded fft
in a threaded loop (with Threads.@threads
) leads to a segfault. Try running this code with JULIA_NUM_THREADS=2
Threads.@threads for i in 1:16
fft(rand(10))
end
One can start Julia with just one thread and set FFTW threads afterwards, this is what I usually do.
yes. This is indeed a shortcoming of the current threading system. Hopefully we get some load/work balancing at some point.
In this specific case the convolution
part of the NFFT is executed via @threads
and the fftw
mutlithreading is to within a threaded loop but subsequently to that.
I agree that it would be nice to expose the features that FFTW exposes. We may do this by simply forwarding the
kargs
in theNFFT
constructor.
I think I can try and do this next week-end. I already did something similar in LombScargle.jl
.
In Julia 1.2 I was able to use threaded fft within threads on an old Mac with 8 threads just fine:
Threads.@threads for i in 1:16
fft(rand(10))
end
So I hope to revisit a threaded version of NFFT when time permits...
Watch out for Julia 1.3. It features a very powerful new threading architecture where basically it will be possible to nest parallel loops without the need to care about if an inner loop might be parallelized. If I understand correctly this is even integrated with C libraries such as FFTW: https://github.com/JuliaMath/FFTW.jl/pull/105.
My most performant NFFT implementation (precompute = FULL) uses an ordinary sparse matrix for the "convolution" matrix. This will require the matrix-vector operation to be multi-threaded. This is discussed here: https://github.com/JuliaLang/julia/pull/29525
Hi Tobias, may I revive this thread (pun intended)? My understanding is that the FFT itself can be parallelized by (globally) setting the nthreads for the FFTW package. When doing so in my use case (using LUT), the computation time is dominated by the convolve_adjoint! and the apodization_adjoint! functions. As I don't really understand the metaprogramming code, I was wondering if you could provide some insights in how difficult it would be to multithread this code.
Thanks for your efforts, much appreciated!
Are you using precomputeB
?
If yes it basically boils down to multithreading sparse matrix operations: https://github.com/tknopp/NFFT.jl/blob/master/src/multidimensional.jl#L112 https://github.com/tknopp/NFFT.jl/blob/master/src/multidimensional.jl#L71
A sparse matrix vector multiplication should be straight forward. There was als a PR to Julia base for this: https://github.com/JuliaLang/julia/pull/29525
I do not. I use each NFFT only once to calculate a Toeplitz kernel (cf. PR #57 ), so I don't think it is worth computing a sparse matrix, or is it? Yet, I create about 1000 Toeplitz kernels (each 512^3 incl. oversampling), so we are still talking ~2h computation. The creation of the LUT is quick, as is the the FFT (on a 40-core machine) and the biggest bottle-necks are the convolution and the apodization, which are single threaded:
[ Info: Timing: conv=1.438624615 fft=0.516005343 apod=1.054821915
So I would looovvvvveee a multi-threaded version of these functions. At least the apodization should be rather trivially parallel, shouldn't it?
Hey Jacob, could you provide me the exact parameters with that you ended up with that numbers? I cannot get an apodization step that is so long for 512^3. In contrast, conv
is much longer in my settings.
In any case: I created the branch tk/multithreading
in which I created some infrastructure to support multithreading in a way that the serial performance is not effected. Basically you can now use @cthreads
wherever you want and it will switch to serial mode when nthreads() == 1
. You can change it on runtime using NFFT._use_threads[] = true/false
.
I also added the multithreaded nfft!
and it directly gave me a factor of 2 speedup on my 2-core CPU. The adjoint will be, however, trickier, so there we still need an efficient implementation.
I cleanup the performance stats code a little bit which will be useful for pushing this forward. There is also a test nfft_performance_2
that tracks the single vs multi-threading performance. I right now get on my 2-core system:
##### nfft_performance_2 - multithreading #####
[ Info: * precomputation = LUT threading = true
Timing: pre = 0.0020 s apod = 0.0020 / 0.0044 s fft = 0.0671 / 0.0670 s conv = 0.7846 / 1.4492 s
apod = 0.0024 / 0.0029 % fft = 0.0786 / 0.0441 % conv = 0.9191 / 0.9530 %
[ Info: * precomputation = LUT threading = false
Timing: pre = 0.0886 s apod = 0.0020 / 0.0045 s fft = 0.0659 / 0.0677 s conv = 1.5835 / 1.4377 s
apod = 0.0012 / 0.0030 % fft = 0.0399 / 0.0448 % conv = 0.9589 / 0.9521 %
Oh, ok. I realize that I use it a somewhat exotic, i.e. highly undersampled case. With an oversampled matrix size of 512^3, I have only 46k data points. I reconstruct, in the widest sense, a time series where I use correlations between the different time frames. Hence the strong under sampling. But I guess this pretty common in dynamic imaging, so a speedup at the for me relevant points could be of interest for a broader community.
I took a look at your multithreading
branch. Great that you get good speed-up in the forward convolution! For calculating the Toeplitz kernel, however, I only use adjoint nfft (as the forward nfft of a centered kronecker delta is just one everywhere). So the functions for which I would like multithreading would be convolve_adjoint_LUT!
and apodization_adjoint!
.
Yes, highly under sampled data is of course a use case!
So we thus need to speed up the apodization as well. We can do this by multi-threading but probably there are some other low hanging fruits. I never investigated that operation because it never was slow enough.
And yes, I know that you want the adoint to be fast. I just wanted to get started with something that is easy :-)
I changed the second performance test so that we can directly test your setup. I know get
[ Info: * precomputation = LUT threading = false
Timing: pre = 0.3447 s apod = 0.1277 / 0.3640 s fft = 12.2369 / 7.3867 s conv = 0.3526 / 1.3175 s
apod = 0.0100 / 0.0401 % fft = 0.9622 / 0.8146 % conv = 0.0277 / 0.1453 %
for N=256 (*2 oversampling) and M = 46000. This is the single-threaded performance. So I am completely FFT dominated. And in particular my anodizations are much faster than in your setup (I am on a 7 years old MacBook).
FYI: I incorporated multithreading in the pre-computation of the LUT in #57 .
So the FFT becomes really fast when a) having many cores (e.g. 40) and b) using the MEASURE flag. The latter allows, in my use case, a 100x speed up compared to the ESTIMATE flag on the same cores. For this reason, I also created an in-place constructor of the NFFTPlan in #57, which allows me to measure the FFT plan once, and the replace the LUT for each time frame.
ok, I see, the MEASURE was what I was missing. That indeed makes thinks drastically faster.
I will have a look at the multithreaded adjoint conv and we can also multi-thread the apodizations. It would be great if you could give me a performance test code similar to nfft_performance_2
such that be can specifically test those edge cases.
I will grant you commit rights such that you can also work directly on branches here.
I added an example to the branch, but it is hard to emulate, as my use case with an oversampled matrix size of 512^3 a) needs more memory than available on some systems (19GB), b) the FFT time is dominant, unless you have many CPUs.
Ok, thanks, we need to tackle this step by step.
In the mean time I implemented a "straight forward" multi-threading implementation here https://github.com/tknopp/NFFT.jl/blob/tk/multithreading/src/multidimensional.jl#L85:
Obviously the allocating part needs to be removed into the planing phase, so this is just experimenting with it. My conclusions are:
So I fear that this is not completely trivial. 1. might be fixable by sorting the nodes. 2. depends a little bit on your use case. Are you using the same plan several time and just exchanging nodes? Then my implementation could work already. We just need to pull g_temp
into the plan.
Another idea would be to do some research on how others have tackled this. There are several GPU implementations out there and I think also the NFFT C library is multi-threaded.
I should have written that before trying to tackle this...
We need the algorithm sketched on pages 13/14 but it doesn't look so easy that one implements that before breakfast ;-)
Yeah, those zero fills can be nasty. That is exactly what I fixed with my first tiny test-balloon PR in the constructor.
Uff, yeah, have a big breakfast first! I see that there is Julia interface to the NFFT3 package and that you are named as a contributor. Do you know the advantages / shortcomings of this C-library for MR image reconstruction? I have never used it...
There is no real downside in NFFT3 and there is even a Julia wrapper for it https://github.com/NFFT/NFFT3.jl
I started NFFT.jl as an experiment to test the language and its claim to be as fast as C. Apparently the experiment was somewhat successful. Both libraries don't have the same features but IMHO it is a huge advantage that the Julia community has its own pure NFFT implementation that remains hackable to people not knowing the NFFT3 code base by heart.
But back to topic: I implemented on the tk/multi-threading two new implementations for apodization!
and apodization_adjoint!
. I switched to Float32, which I would recommend you to do as well, and the runtime seems to be reduced by a factor of 2-4. This is still without multi-threading. Multi-threading these operations is feasible but not so simple using the Cartesian macros. In my updated version I already peel off the inner for loop. For proper multi-threading I would need to peel off the outer for loop as well. Not sure how to do that though.
Update: I finally managed to also multithread the apodization function which will give you an additional speedup. I only have two cores right now, so I cannot test it under larger scale. But I hope with this the apodization is not the bottleneck in your benchmarks anymore.
Note to myself: We will use the multi-threading strategy for the adjoint convolve described in this paper: https://arxiv.org/abs/1808.06736
Done! We now use a similar blocking strategy as outlined in FINUFFT, which is actually quite similar to NFFT3. I did not went to the complete details but if I understand it correctly, the difference is just that NFFT3 makes a blocking on the linearized memory, while FINUFFT blocks in both directions.
The nice thing is that blocking even helps in the non-threaded case, so we can use this as the default path. Here are some preliminary benchmarks. They still have some smaller problems but I don't expect larger deviations anymore. For NFFT3 I use the PRE_PSI instead PRE_LIN_PSI which gives faster nfft for the cost of slightly longer precomputation.
@tknopp: I have finally gotten around to parallelizing parts of NFFT. With the new
Threads.@threads
theconvolve!
function uses multithreading -- check my branch.Compare a profiling of the old code (The purple part on the same level as the highlighted
convolve!
isfft
.)With the new code (with
export JULIA_NUM_THREADS=2
):The (regular) FFT call now consumes a significant proportion of the time.
Recalling stevengj's comments in https://github.com/tknopp/NFFT.jl/issues/17 FFTW 3.3.5 is needed to multithread
fft
and this version is used in Julia v0.6. But I can't find info on how to use multithreading forfft
.