flintlib / arb

Arb has been merged into FLINT -- use https://github.com/flintlib/flint/ instead
http://arblib.org/
GNU Lesser General Public License v2.1
455 stars 137 forks source link

add radix2 dft multi-threading #340

Closed p15-git-acc closed 3 years ago

p15-git-acc commented 3 years ago

This PR adds radix 2 dft multi-threading. It's slower for small vectors and faster for large vectors, so I've hardcoded a threshold where multi-threading will only be used implicitly when the vector length is greater than 2^9.

The dft consists of three nested loops. The outer serial loop does log2(n) iterations and is not parallelized at all. The two nested inner loops consist of 2^a and 2^b iterations respectively for each combination of a and b where a+b+1= log2(n), and they can be jointly parallelized to spread the work evenly among threads. I used the threaded matrix multiplication code as the example of how to do multi-threading with the pthreads library.

I see that @pascalmolin wrote a lot of the dft stuff in arb, maybe if I did something dumb in this PR he would notice.

Here's some output of a modified version of https://github.com/fredrik-johansson/arb/blob/master/acb_dft/profile/p-dft.c, 'default' is single threaded rad2, 'thread4' is rad2 with four threads, and 'precomp' is single threaded rad2 with setup/teardown (including calculation of roots of unity) pulled out of the timing repetition loop.

1280 * DFT(256 = 2^8), prec 64....
default        ...  cpu/wall(s): 0.367 0.367

thread4        ...  cpu/wall(s): 1.037 0.499

precomp        ...  cpu/wall(s): 0.355 0.354

640 * DFT(512 = 2^9), prec 64....
default        ...  cpu/wall(s): 0.413 0.413

thread4        ...  cpu/wall(s): 0.876 0.372

precomp        ...  cpu/wall(s): 0.405 0.404

320 * DFT(1024 = 2^10), prec 64....
default        ...  cpu/wall(s): 0.457 0.456

thread4        ...  cpu/wall(s): 0.738 0.277

precomp        ...  cpu/wall(s): 0.456 0.457

160 * DFT(2048 = 2^11), prec 64....
default        ...  cpu/wall(s): 0.527 0.528

thread4        ...  cpu/wall(s): 0.705 0.249

precomp        ...  cpu/wall(s): 0.514 0.513

80 * DFT(4096 = 2^12), prec 64....
default        ...  cpu/wall(s): 0.586 0.586

thread4        ...  cpu/wall(s): 0.7 0.218

precomp        ...  cpu/wall(s): 0.563 0.564

40 * DFT(8192 = 2^13), prec 64....
default        ...  cpu/wall(s): 0.628 0.631

thread4        ...  cpu/wall(s): 0.702 0.205

precomp        ...  cpu/wall(s): 0.607 0.607

640 * DFT(256 = 2^8), prec 128....
default        ...  cpu/wall(s): 0.233 0.233

thread4        ...  cpu/wall(s): 0.604 0.29

precomp        ...  cpu/wall(s): 0.225 0.226

320 * DFT(512 = 2^9), prec 128....
default        ...  cpu/wall(s): 0.278 0.278

thread4        ...  cpu/wall(s): 0.518 0.219

precomp        ...  cpu/wall(s): 0.272 0.272

160 * DFT(1024 = 2^10), prec 128....
default        ...  cpu/wall(s): 0.301 0.302

thread4        ...  cpu/wall(s): 0.46 0.178

precomp        ...  cpu/wall(s): 0.293 0.294

80 * DFT(2048 = 2^11), prec 128....
default        ...  cpu/wall(s): 0.334 0.334

thread4        ...  cpu/wall(s): 0.431 0.142

precomp        ...  cpu/wall(s): 0.322 0.322

40 * DFT(4096 = 2^12), prec 128....
default        ...  cpu/wall(s): 0.363 0.363

thread4        ...  cpu/wall(s): 0.444 0.144

precomp        ...  cpu/wall(s): 0.363 0.363

20 * DFT(8192 = 2^13), prec 128....
default        ...  cpu/wall(s): 0.398 0.398

thread4        ...  cpu/wall(s): 0.48 0.152

precomp        ...  cpu/wall(s): 0.427 0.426

160 * DFT(256 = 2^8), prec 512....
default        ...  cpu/wall(s): 0.112 0.112

thread4        ...  cpu/wall(s): 0.232 0.101

precomp        ...  cpu/wall(s): 0.098 0.098

80 * DFT(512 = 2^9), prec 512....
default        ...  cpu/wall(s): 0.122 0.122

thread4        ...  cpu/wall(s): 0.208 0.099

precomp        ...  cpu/wall(s): 0.116 0.116

40 * DFT(1024 = 2^10), prec 512....
default        ...  cpu/wall(s): 0.134 0.134

thread4        ...  cpu/wall(s): 0.182 0.072

precomp        ...  cpu/wall(s): 0.122 0.122

20 * DFT(2048 = 2^11), prec 512....
default        ...  cpu/wall(s): 0.143 0.143

thread4        ...  cpu/wall(s): 0.205 0.068

precomp        ...  cpu/wall(s): 0.139 0.139

10 * DFT(4096 = 2^12), prec 512....
default        ...  cpu/wall(s): 0.157 0.157

thread4        ...  cpu/wall(s): 0.214 0.076

precomp        ...  cpu/wall(s): 0.154 0.155

5 * DFT(8192 = 2^13), prec 512....
default        ...  cpu/wall(s): 0.175 0.175

thread4        ...  cpu/wall(s): 0.207 0.065

precomp        ...  cpu/wall(s): 0.163 0.163

64 * DFT(256 = 2^8), prec 1024....
default        ...  cpu/wall(s): 0.066 0.066

thread4        ...  cpu/wall(s): 0.128 0.051

precomp        ...  cpu/wall(s): 0.064 0.065

32 * DFT(512 = 2^9), prec 1024....
default        ...  cpu/wall(s): 0.075 0.074

thread4        ...  cpu/wall(s): 0.12 0.044

precomp        ...  cpu/wall(s): 0.072 0.071

16 * DFT(1024 = 2^10), prec 1024....
default        ...  cpu/wall(s): 0.086 0.086

thread4        ...  cpu/wall(s): 0.122 0.04

precomp        ...  cpu/wall(s): 0.082 0.082

8 * DFT(2048 = 2^11), prec 1024....
default        ...  cpu/wall(s): 0.092 0.092

thread4        ...  cpu/wall(s): 0.134 0.046

precomp        ...  cpu/wall(s): 0.092 0.092

4 * DFT(4096 = 2^12), prec 1024....
default        ...  cpu/wall(s): 0.103 0.103

thread4        ...  cpu/wall(s): 0.145 0.048

precomp        ...  cpu/wall(s): 0.103 0.104

2 * DFT(8192 = 2^13), prec 1024....
default        ...  cpu/wall(s): 0.119 0.119

thread4        ...  cpu/wall(s): 0.151 0.048

precomp        ...  cpu/wall(s): 0.116 0.116