This PR adds radix 2 dft multi-threading. It's slower for small vectors and faster for large vectors, so I've hardcoded a threshold where multi-threading will only be used implicitly when the vector length is greater than 2^9.
The dft consists of three nested loops. The outer serial loop does log2(n) iterations and is not parallelized at all. The two nested inner loops consist of 2^a and 2^b iterations respectively for each combination of a and b where a+b+1= log2(n), and they can be jointly parallelized to spread the work evenly among threads. I used the threaded matrix multiplication code as the example of how to do multi-threading with the pthreads library.
I see that @pascalmolin wrote a lot of the dft stuff in arb, maybe if I did something dumb in this PR he would notice.
Here's some output of a modified version of https://github.com/fredrik-johansson/arb/blob/master/acb_dft/profile/p-dft.c, 'default' is single threaded rad2, 'thread4' is rad2 with four threads, and 'precomp' is single threaded rad2 with setup/teardown (including calculation of roots of unity) pulled out of the timing repetition loop.
This PR adds radix 2 dft multi-threading. It's slower for small vectors and faster for large vectors, so I've hardcoded a threshold where multi-threading will only be used implicitly when the vector length is greater than 2^9.
The dft consists of three nested loops. The outer serial loop does log2(n) iterations and is not parallelized at all. The two nested inner loops consist of 2^a and 2^b iterations respectively for each combination of a and b where a+b+1= log2(n), and they can be jointly parallelized to spread the work evenly among threads. I used the threaded matrix multiplication code as the example of how to do multi-threading with the pthreads library.
I see that @pascalmolin wrote a lot of the dft stuff in arb, maybe if I did something dumb in this PR he would notice.
Here's some output of a modified version of https://github.com/fredrik-johansson/arb/blob/master/acb_dft/profile/p-dft.c, 'default' is single threaded rad2, 'thread4' is rad2 with four threads, and 'precomp' is single threaded rad2 with setup/teardown (including calculation of roots of unity) pulled out of the timing repetition loop.