Multithreading - Githubissues

robertdj commented 7 years ago

@tknopp: I have finally gotten around to parallelizing parts of NFFT. With the new Threads.@threads the convolve! function uses multithreading -- check my branch.

Compare a profiling of the old code (The purple part on the same level as the highlighted convolve! is fft.) no_threads

With the new code (with export JULIA_NUM_THREADS=2): with_threads

The (regular) FFT call now consumes a significant proportion of the time.

Recalling stevengj's comments in https://github.com/tknopp/NFFT.jl/issues/17 FFTW 3.3.5 is needed to multithread fft and this version is used in Julia v0.6. But I can't find info on how to use multithreading for fft.

tknopp commented 7 years ago

nice, so funny seeing this stuff landing in this package. I worked on the original multithreading code in Julia https://github.com/JuliaLang/julia/pull/6741

Regarding FFTW: You just have to call FFTW.set_num_threads.

robertdj commented 7 years ago

I didn't realize that multithreading in Julia was so old!

Would it make sense to add

FFTW.set_num_threads(Threads.nthread())

in NFFT?

tknopp commented 7 years ago

Yes this sounds good. Although this probably needs a more general solution since its always a little bit bas if we change global state in a user library. Maybe

if Threads.nthread() > 1
  FFTW.set_num_threads(Threads.nthread())
end

is a little safer since we would not set it to 1 if a user runs Julia in serial mode but has set FFTW to be parallel

robertdj commented 7 years ago

Good point with the global state.

There's not much documentation for FFTW.set_num_threads, but do you know if it can be set only for the NFFT module?

tknopp commented 7 years ago

No thats not possible. Its a global value in libfftw. There seems to be currently even no getter so that we cannot know to what is had been set.

But I would not care to much. More interesting is how good your speed ups are. Parallel nfft is quite nice and that often seen in the wild... ;-)

robertdj commented 7 years ago

I'm currently on an old computer with only 2 threads and here it's about 30% faster for a 2D example I'm testing.

robertdj commented 7 years ago

More timing suggests that the threading constructs take quite some time. E.g. for 2D case in nfft_performance:

4 threads: 2.5 times faster
8 threads: 3.6 times faster

tknopp commented 7 years ago

Hey, that not that bad. In multithreading its not that simple to achieve full speedups. So this is not necessary an issue with the Julia implementation of threads.

The issue in multithreading is usually memory accesses to similar regions from different threads. This will lead fighting CPU caches.

Do we already sort the k-space nodes? Might be a good idea...

robertdj commented 7 years ago

Are the k-nodes part of the NFFTPlan?

tknopp commented 7 years ago

I meant x

https://github.com/tknopp/NFFT.jl/blob/master/src/NFFT.jl#L56

I usually use NFFT for MRI reconstruction where x is k (Fourier space is named k-space)

robertdj commented 7 years ago

We don't currently sort x. I haven't thought about this before, but indexing the input and window arrays would be more cache efficient with a suitable sorting of x. Let me check how sorting affects the timing.

robertdj commented 7 years ago

One problem with sorting in the plan creation is that fHat must be sorted similarly...

tknopp commented 7 years ago

Yes indeed. One might for with an indirect index. Might make sense to look how the solved this in the NFFT C library:

https://github.com/NFFT/nfft/blob/9fbc5ab42fe6c690d2b82b971b3566be4e6df165/kernel/nfft/nfft.c#L111

tknopp commented 7 years ago

I have merged your multithreaded convolve into master

robertdj commented 7 years ago

@tknopp It seems to be more difficult to utilize multithreading with convolve_adjoint since we cannot work independently with each k iteration. Do you have an idea how to get around this?

tknopp commented 7 years ago

Yes, this is a little trickier and I don't have a direct solution. However, look at this from the manual:

Threads.@threads for i = 1:10
    a[i] = Threads.threadid()
end

It means that you have access to the threadId. We might need to have some temporary array within the loop where intermediate results from the sum are stored. And then a final sequential step to combine the result. An alternative would be a lock within the loop. But I don't know if locks are exposed within the threading interface.

tknopp commented 7 years ago

There is in base Threads.atomic_add!. Have not played around with that. Also not sure if it is actually feasible having an array of atomics...

giordano commented 7 years ago

A possibility to speed-up calculations is to use FFTW planning flags: plan_fft!(tmpVec, flags = FFTW.MEASURE). This will spend more time in the planning phase though (but I think this isn't usually a problem).

As a side note, I wouldn't force the number of FFTW threads to the number of Julia threads, the user should set them explicitely.

tknopp commented 7 years ago

I agree that it would be nice to expose the features that FFTW exposes. We may do this by simply forwarding the kargs in the NFFT constructor.

Why would one use a different number of threads in the Julia multi-threading part in contrast to the FFTW multithreading part?

giordano commented 7 years ago

Why would one use a different number of threads in the Julia multi-threading part in contrast to the FFTW multithreading part?

Well, they're different things. I guess this isn't used in NFFT, but nesting a threaded fft in a threaded loop (with Threads.@threads) leads to a segfault. Try running this code with JULIA_NUM_THREADS=2

Threads.@threads for i in 1:16
    fft(rand(10))
end

One can start Julia with just one thread and set FFTW threads afterwards, this is what I usually do.

tknopp commented 7 years ago

yes. This is indeed a shortcoming of the current threading system. Hopefully we get some load/work balancing at some point. In this specific case the convolution part of the NFFT is executed via @threads and the fftw mutlithreading is to within a threaded loop but subsequently to that.

giordano commented 7 years ago

I agree that it would be nice to expose the features that FFTW exposes. We may do this by simply forwarding the kargs in the NFFT constructor.

I think I can try and do this next week-end. I already did something similar in LombScargle.jl.

JeffFessler commented 4 years ago

In Julia 1.2 I was able to use threaded fft within threads on an old Mac with 8 threads just fine:

Threads.@threads for i in 1:16
    fft(rand(10))
end

So I hope to revisit a threaded version of NFFT when time permits...

tknopp commented 4 years ago

Watch out for Julia 1.3. It features a very powerful new threading architecture where basically it will be possible to nest parallel loops without the need to care about if an inner loop might be parallelized. If I understand correctly this is even integrated with C libraries such as FFTW: https://github.com/JuliaMath/FFTW.jl/pull/105.

My most performant NFFT implementation (precompute = FULL) uses an ordinary sparse matrix for the "convolution" matrix. This will require the matrix-vector operation to be multi-threaded. This is discussed here: https://github.com/JuliaLang/julia/pull/29525

JakobAsslaender commented 2 years ago

Hi Tobias, may I revive this thread (pun intended)? My understanding is that the FFT itself can be parallelized by (globally) setting the nthreads for the FFTW package. When doing so in my use case (using LUT), the computation time is dominated by the convolve_adjoint! and the apodization_adjoint! functions. As I don't really understand the metaprogramming code, I was wondering if you could provide some insights in how difficult it would be to multithread this code.

Thanks for your efforts, much appreciated!

tknopp commented 2 years ago

Are you using precomputeB?

If yes it basically boils down to multithreading sparse matrix operations: https://github.com/tknopp/NFFT.jl/blob/master/src/multidimensional.jl#L112 https://github.com/tknopp/NFFT.jl/blob/master/src/multidimensional.jl#L71

A sparse matrix vector multiplication should be straight forward. There was als a PR to Julia base for this: https://github.com/JuliaLang/julia/pull/29525

JakobAsslaender commented 2 years ago

I do not. I use each NFFT only once to calculate a Toeplitz kernel (cf. PR #57 ), so I don't think it is worth computing a sparse matrix, or is it? Yet, I create about 1000 Toeplitz kernels (each 512^3 incl. oversampling), so we are still talking ~2h computation. The creation of the LUT is quick, as is the the FFT (on a 40-core machine) and the biggest bottle-necks are the convolution and the apodization, which are single threaded:

[ Info: Timing: conv=1.438624615 fft=0.516005343 apod=1.054821915

So I would looovvvvveee a multi-threaded version of these functions. At least the apodization should be rather trivially parallel, shouldn't it?

tknopp commented 2 years ago

Hey Jacob, could you provide me the exact parameters with that you ended up with that numbers? I cannot get an apodization step that is so long for 512^3. In contrast, conv is much longer in my settings.

In any case: I created the branch tk/multithreading in which I created some infrastructure to support multithreading in a way that the serial performance is not effected. Basically you can now use @cthreads wherever you want and it will switch to serial mode when nthreads() == 1. You can change it on runtime using NFFT._use_threads[] = true/false.

I also added the multithreaded nfft! and it directly gave me a factor of 2 speedup on my 2-core CPU. The adjoint will be, however, trickier, so there we still need an efficient implementation.

tknopp commented 2 years ago

I cleanup the performance stats code a little bit which will be useful for pushing this forward. There is also a test nfft_performance_2 that tracks the single vs multi-threading performance. I right now get on my 2-core system:

 ##### nfft_performance_2 - multithreading ##### 

[ Info: * precomputation = LUT threading = true
Timing: pre = 0.0020 s apod = 0.0020 / 0.0044 s fft = 0.0671 / 0.0670 s conv = 0.7846 / 1.4492 s
                       apod = 0.0024 / 0.0029 % fft = 0.0786 / 0.0441 % conv = 0.9191 / 0.9530 %
[ Info: * precomputation = LUT threading = false
Timing: pre = 0.0886 s apod = 0.0020 / 0.0045 s fft = 0.0659 / 0.0677 s conv = 1.5835 / 1.4377 s
                       apod = 0.0012 / 0.0030 % fft = 0.0399 / 0.0448 % conv = 0.9589 / 0.9521 %

JakobAsslaender commented 2 years ago

Oh, ok. I realize that I use it a somewhat exotic, i.e. highly undersampled case. With an oversampled matrix size of 512^3, I have only 46k data points. I reconstruct, in the widest sense, a time series where I use correlations between the different time frames. Hence the strong under sampling. But I guess this pretty common in dynamic imaging, so a speedup at the for me relevant points could be of interest for a broader community.

I took a look at your multithreading branch. Great that you get good speed-up in the forward convolution! For calculating the Toeplitz kernel, however, I only use adjoint nfft (as the forward nfft of a centered kronecker delta is just one everywhere). So the functions for which I would like multithreading would be convolve_adjoint_LUT! and apodization_adjoint!.

tknopp commented 2 years ago

Yes, highly under sampled data is of course a use case!

So we thus need to speed up the apodization as well. We can do this by multi-threading but probably there are some other low hanging fruits. I never investigated that operation because it never was slow enough.

And yes, I know that you want the adoint to be fast. I just wanted to get started with something that is easy :-)

tknopp commented 2 years ago

I changed the second performance test so that we can directly test your setup. I know get

[ Info: * precomputation = LUT threading = false
Timing: pre = 0.3447 s apod = 0.1277 / 0.3640 s fft = 12.2369 / 7.3867 s conv = 0.3526 / 1.3175 s
                       apod = 0.0100 / 0.0401 % fft = 0.9622 / 0.8146 % conv = 0.0277 / 0.1453 %

for N=256 (*2 oversampling) and M = 46000. This is the single-threaded performance. So I am completely FFT dominated. And in particular my anodizations are much faster than in your setup (I am on a 7 years old MacBook).

JakobAsslaender commented 2 years ago

FYI: I incorporated multithreading in the pre-computation of the LUT in #57 .

So the FFT becomes really fast when a) having many cores (e.g. 40) and b) using the MEASURE flag. The latter allows, in my use case, a 100x speed up compared to the ESTIMATE flag on the same cores. For this reason, I also created an in-place constructor of the NFFTPlan in #57, which allows me to measure the FFT plan once, and the replace the LUT for each time frame.

tknopp commented 2 years ago

ok, I see, the MEASURE was what I was missing. That indeed makes thinks drastically faster.

I will have a look at the multithreaded adjoint conv and we can also multi-thread the apodizations. It would be great if you could give me a performance test code similar to nfft_performance_2 such that be can specifically test those edge cases.

I will grant you commit rights such that you can also work directly on branches here.

JakobAsslaender commented 2 years ago

I added an example to the branch, but it is hard to emulate, as my use case with an oversampled matrix size of 512^3 a) needs more memory than available on some systems (19GB), b) the FFT time is dominant, unless you have many CPUs.

tknopp commented 2 years ago

Ok, thanks, we need to tackle this step by step.

In the mean time I implemented a "straight forward" multi-threading implementation here https://github.com/tknopp/NFFT.jl/blob/tk/multithreading/src/multidimensional.jl#L85:

Obviously the allocating part needs to be removed into the planing phase, so this is just experimenting with it. My conclusions are:

For M=N I get nice speedup but not as good as for the regular convolve
In your case the zero filling is dominating time

So I fear that this is not completely trivial. 1. might be fixable by sorting the nodes. 2. depends a little bit on your use case. Are you using the same plan several time and just exchanging nodes? Then my implementation could work already. We just need to pull g_temp into the plan.

Another idea would be to do some research on how others have tackled this. There are several GPU implementations out there and I think also the NFFT C library is multi-threaded.

tknopp commented 2 years ago

https://www-user.tu-chemnitz.de/~potts/paper/openmpNFFT.pdf

tknopp commented 2 years ago

I should have written that before trying to tackle this...

We need the algorithm sketched on pages 13/14 but it doesn't look so easy that one implements that before breakfast ;-)

JakobAsslaender commented 2 years ago

Yeah, those zero fills can be nasty. That is exactly what I fixed with my first tiny test-balloon PR in the constructor.

Uff, yeah, have a big breakfast first! I see that there is Julia interface to the NFFT3 package and that you are named as a contributor. Do you know the advantages / shortcomings of this C-library for MR image reconstruction? I have never used it...

tknopp commented 2 years ago

There is no real downside in NFFT3 and there is even a Julia wrapper for it https://github.com/NFFT/NFFT3.jl

I started NFFT.jl as an experiment to test the language and its claim to be as fast as C. Apparently the experiment was somewhat successful. Both libraries don't have the same features but IMHO it is a huge advantage that the Julia community has its own pure NFFT implementation that remains hackable to people not knowing the NFFT3 code base by heart.

But back to topic: I implemented on the tk/multi-threading two new implementations for apodization! and apodization_adjoint!. I switched to Float32, which I would recommend you to do as well, and the runtime seems to be reduced by a factor of 2-4. This is still without multi-threading. Multi-threading these operations is feasible but not so simple using the Cartesian macros. In my updated version I already peel off the inner for loop. For proper multi-threading I would need to peel off the outer for loop as well. Not sure how to do that though.

tknopp commented 2 years ago

Update: I finally managed to also multithread the apodization function which will give you an additional speedup. I only have two cores right now, so I cannot test it under larger scale. But I hope with this the apodization is not the bottleneck in your benchmarks anymore.

tknopp commented 2 years ago

Note to myself: We will use the multi-threading strategy for the adjoint convolve described in this paper: https://arxiv.org/abs/1808.06736

tknopp commented 2 years ago

Done! We now use a similar blocking strategy as outlined in FINUFFT, which is actually quite similar to NFFT3. I did not went to the complete details but if I understand it correctly, the difference is just that NFFT3 makes a blocking on the linearized memory, while FINUFFT blocks in both directions.

The nice thing is that blocking even helps in the non-threaded case, so we can use this as the default path. Here are some preliminary benchmarks. They still have some smaller problems but I don't expect larger deviations anymore. For NFFT3 I use the PRE_PSI instead PRE_LIN_PSI which gives faster nfft for the cost of slightly longer precomputation.

performance_mt_LUT

JuliaMath / NFFT.jl

Multithreading #19