flintlib / arb

Arb has been merged into FLINT -- use https://github.com/flintlib/flint/ instead
http://arblib.org/
GNU Lesser General Public License v2.1
455 stars 137 forks source link

Adjust the height at which multi-threading begins in Platt's zeta zeros algorithm. #335

Closed p15-git-acc closed 3 years ago

p15-git-acc commented 3 years ago

See https://github.com/fredrik-johansson/arb/issues/320#issuecomment-696254301.

rudolph-git-acc commented 3 years ago

Did some further testing on this and the data below confirms that multi-threading doesn't seem to yield any benefits <10^16, Surprisingly, at these lower heights it doesn't seem to have any adverse implications either on CPU-time or memory needs.

For the CPU time, I think the explanation is that building the S_tableat these lower heights is just a very small fraction of the overall CPU-time required (J and 'A, B, K' are all relatively small).

Even at 10^15, memory claims for a single thread or for 64 threads are the same (64-threaded is only 7 secs faster). The only explanation I could think of, is that other processes than building the S_table, drive the memory statistics below. Only when the A=16 heuristics kick in (currently at 10^15) the total memory consumption for the S_table will exceed the claims from the other processes and its memory needs start to dominate.

Since there don't seem to be any benefits nor adverse implications for a user accidentally running lower heights with too many threads, I wonder whether a threshold actually needs to be hard coded.

This leaves open my suggestion in #320 to let the A=16-heuristics kick in at n=10^16 instead of at n=10^15. Any thoughts on that?

threads height  prec    #req        cpu  wall(s)           virt   peak   res   peak(MB): 
1   10^4    256 1000      83.819 83.820       246.66 672.71 226.54 652.48
64  10^4    256 1000      83.699 83.700       246.66 672.71 226.62 652.60

1   10^5    256 1000      82.501 82.501       280.37 672.71 260.42 652.48
64  10^5    256 1000      84.449 84.450       280.37 672.71 260.37 652.60

1   10^6    256 1000      82.860 82.862       280.37 672.71 260.42 652.48
64  10^6    256 1000      82.109 82.110       280.37 672.71 260.37 652.60

1   10^7    256 1000      82.658 82.659       280.37 672.71 260.42 652.48
64  10^7    256 1000      82.702 82.704       280.37 672.71 260.37 652.60

1   10^8    256 1000      81.022 81.023       280.37 672.71 260.42 652.48
64  10^8    256 1000      80.902 80.903       280.37 672.71 260.37 652.60

1   10^9    256 1000      80.018 80.019       267.88 672.71 247.95 652.48
64  10^9    256 1000      79.399 79.400       267.88 672.71 247.90 652.60

1   10^10   256 1000      76.946 76.947       267.88 672.71 247.95 652.48
64  10^10   256 1000      76.024 76.026       267.88 672.71 247.90 652.60

1   10^11   256 1000      75.951 75.952       267.88 672.71 247.95 652.48
64  10^11   256 1000      75.115 75.116       267.88 672.71 247.90 652.60

1   10^12   256 1000      74.821 74.821       251.23 672.71 231.30 652.48
64  10^12   256 1000      73.749 73.751       251.23 672.71 231.25 652.60

1   10^13   256 1000      78.731 78.732       251.23 672.71 231.30 652.48
64  10^13   256 1000      78.050 78.050       251.23 672.71 231.25 652.60

1   10^14   256 1000      98.647 98.648       251.23 672.71 231.30 652.48
64  10^14   256 1000      98.285 98.287       251.23 672.71 231.25 652.60

1   10^15   256 1000    729.304 729.312       1077.56 3045.38 1057.51 3025.27 (A=16, K=76)
64  10^15   256 1000    722.491 722.502       1077.56 3045.38 1057.46 3025.40

1   10^16   256 1000    1066.24 1066.21       1060.68  3004.51 1040.27  2984.18 (A=16, K=74)
64  10^16   256 1000    2038.12 605.065       5188.67 62425.66 1137.45 57888.95

1   10^17   256 1000    2058.28 2058.34       1067.74  2963.58 1047.46  2943.36 (A=16, K=72)
64  10^17   256 1000    4490.28 606.213       5259.71 60865.66 1151.52 56343.93

1   10^18   256 1000    5127.71 5127.76       1074.13  2922.18 1053.92  2901.79 (A=16, K=70)
64  10^18   256 1000    13246.2 724.053       5434.19 60865.66 1352.93 56343.93
p15-git-acc commented 3 years ago

I haven't done anything for this issue yet, but I thought I'd mention a couple of things related to creating/interpreting these tables.

One is related to the number of requested zeros. The local zeros algorithm roughly progresses through three stages:

1) make the table that requires J iterations 2) dft 3) turing's method

I haven't profiled the code in detail, but I'm pretty sure that (2) dominates run time at low height and (1) dominates at greater heights. With the current heuristics system, the number of requested zeros only comes into play at step (3) if I remember correctly. Anyway my point is that if you request many more zeros (like A*B zeros) and note the number of zeros returned then it will give more information for the table and I don't think it should require much more run time. With this information the per-isolated-zero amortized cpu usage could be calculated.

The second thing I thought I should mention is that I think when you made the table, you were running a version of the code that already has a hard coded multithreading cutoff. This is why it looks like multi-threading isn't doing anything for low heights -- it's because it wasn't being used.

rudolph-git-acc commented 3 years ago

(...)you were running a version of the code that already has a hard coded multithreading cutoff.(...)

Yes, that explains it! I hadn't realised that this restriction was hard coded in the latest version of multieval.c:

if (flint_get_num_threads() > 1 && J > 10000000)
    {
        acb_dirichlet_platt_multieval_threaded(
                out, T, A, B, h, J, K, sigma, prec);

Running the table again without this constraint, yields the results below. At lower heights, multi-threading works adversely (more memory needed and slower). The first benefits from threading become visible around 10^13, but come at the cost of considerable memory claims. Since the CPU wall-times are quite low at these heights, the constraint J > 10mln actually does look reasonable (unless a user wants to compute many batches of 1000 zeros at 10^14, then a ~30% speed gain will not be accessible).

I will try to do some profiling on the three stages you mentioned in your post.

#zeros requested = 1000, prec = 256:

thread height                                          #found    cpu  wall(s)   virt     peak     res    peak(MB)
1   10^4    A: 8    B: 4096 K: 64   J: 66       1000     84.094 84.096  246.66   672.70  226.56   652.48
64  10^4    A: 8    B: 4096 K: 64   J: 66       1000     89.747 89.731 1655.99 14232.47 1092.89 13189.96

1   10^5    A: 8    B: 4096 K: 61   J: 173      1000     82.209 82.209  280.24   672.70  260.20   652.48
64  10^5    A: 8    B: 4096 K: 61   J: 173      1000     88.192 88.172 1783.99 14232.47 1093.23 13189.96

1   10^6    A: 8    B: 4096 K: 59   J: 477      1000     84.895 84.897  280.24   672.70  260.20   652.48
64  10^6    A: 8    B: 4096 K: 59   J: 477      1000     88.160 88.120 2231.99 14232.47 1095.13 13189.96

1   10^7    A: 8    B: 4096 K: 57   J: 1331     1000     82.813 82.814  280.24   672.70  260.20   652.48
64  10^7    A: 8    B: 4096 K: 57   J: 1331     1000     87.733 87.641 2540.19 14232.47 1022.57 13189.96

1   10^8    A: 8    B: 4096 K: 55   J: 3842     1000     79.647 79.647  280.24   672.70  260.20   652.48
64  10^8    A: 8    B: 4096 K: 55   J: 3842     1000     85.641 85.428 3756.19 14466.67 1026.29 13189.96

1   10^9    A: 8    B: 4096 K: 53   J: 11196    1000     79.361 79.361  267.65   672.70  247.62   652.48
64  10^9    A: 8    B: 4096 K: 53   J: 11196    1000     84.960 84.399 4908.19 15127.65 1032.13 13189.96

1   10^10   A: 8    B: 4096 K: 50   J: 32887    1000     76.459 76.458  261.28   672.70  241.24   652.48
64  10^10   A: 8    B: 4096 K: 50   J: 32887    1000     82.844 81.202 5164.19 15127.65 1038.04 13189.96

1   10^11   A: 8    B: 4096 K: 48   J: 97207    1000     75.185 75.183  261.28   672.70  241.24   652.48
64  10^11   A: 8    B: 4096 K: 48   J: 97207    1000     82.755 78.383 5164.19 15127.65 1046.16 13189.96

1   10^12   A: 8    B: 4096 K: 45   J: 288793   1000     74.382 74.382  261.28   672.70  241.24   652.48
64  10^12   A: 8    B: 4096 K: 45   J: 288793   1000     87.244 75.156 5164.19 15127.65 1056.92 13189.96

1   10^13   A: 8    B: 4096 K: 42   J: 861606   1000       78.452   78.451  249.12   672.70  229.09   652.48
64  10^13   A: 8    B: 4096 K: 42   J: 861606   1000      103.912   70.913 5164.19 15127.65 1061.01 13189.96

1   10^14   A: 8    B: 4096 K: 40   J: 2579682  1000       99.054   99.053  249.12   672.70  229.09   652.48
64  10^14   A: 8    B: 4096 K: 40   J: 2579682  1000      162.040   69.185 5164.19 15127.65 1075.24 13189.96

1   10^15   A: 16   B: 8192 K: 76   J: 8047580  1000      739.403  739.402 1065.14  3044.97 1044.98  3024.94
64  10^15   A: 16   B: 8192 K: 76   J: 8047580  1000     1101.740  596.905 5349.18 64924.68 1331.13 60427.45

1   10^16   A: 16   B: 8192 K: 74   J: 24435419 1000     1073.620 1073.620 1228.68  3044.97 1208.52  3024.94
64  10^16   A: 16   B: 8192 K: 74   J: 24435419 1000     2071.190  597.879 5339.47 64924.68 1342.18 60427.45

1   10^17   A: 16   B: 8192 K: 72   J: 74366208 1000     2132.860 2132.850 1111.97  3046.44 1091.94  3026.09
64  10^17   A: 16   B: 8192 K: 72   J: 74366208 1000     4904.930  628.124 5311.86 64924.68 1332.42 60427.45
p15-git-acc commented 3 years ago

Besides requesting at least A*B zeros (say 150000 of them instead of 1000) and showing the number of returned zeros in the table so that amortized cost can be calculated, the median mag_rad or rel_accuracy of the isolated zeros would also be useful. These would be especially helpful in the future if different parameter settings are compared for the same height.

I think it should also be possible to slightly irregularly tile the J loop so that each of n threads does close to J/n iterations and each thread writes only to its own section of the table. This would mean that the per-thread temporary tables would not be required, so multi-threading wouldn't have any memory premium. Less importantly it also occurs to me that the larger j iterations might take longer than the smaller j iterations, so maybe trying to give each thread approximately J/n iterations would not be ideal in terms of balancing the calculations among threads. Another thing is that I didn't try to micro optimize the j loop itself because I profiled mainly at lower heights where that loop is not the limiting factor.

Edit: https://github.com/fredrik-johansson/arb/pull/339 removes the multi-threading memory overhead

rudolph-git-acc commented 3 years ago

339 removes the multi-threading memory overhead

Wow! That is a step change forward, since a larger number of threads will no longer induce a memory constraint. Very nice :-) I'll test it later today.

Besides requesting at least A*B zeros (say 150000 of them instead of 1000) and showing the number of returned zeros in the table so that amortized cost can be calculated, the median mag_rad or rel_accuracy of the isolated zeros would also be useful.

Below the first results. Let me know if other data is required. I used arb_rel_accuracy_bits as the criterium to sort the full list returned by acb_dirichlet_platts_local_hardy_z_zeros and then computed the lowest, median and highest relative accuracy of the actual zeros found.

150000 zeros requested at prec: 128 using 1 thread(s)
                                                                               relative acc (bits)
height  A       B       K       J              #found   cpu/wall(s)           lowest  median  highest
10^4    8   4096    64  66      1010    54.33 54.345        19  72  93  
10^5    8   4096    61  173     1184    55.265 55.266       19  74  94  
10^6    8   4096    59  477     1475    57.677 57.678       24  75  94  
10^7    8   4096    57  1331        1867    59.884 59.871       27  76  95  
10^8    8   4096    55  3842        2158    61.377 61.369       31  76  95  
10^9    8   4096    53  11196       2464    62.9 62.895     34  77  94  
10^10   8   4096    50  32887       2731    63.718 63.715       37  78  95  
10^11   8   4096    48  97207       2963    65.524 65.522       43  79  95  
10^12   8   4096    45  288793      3089    66.374 66.373       50  81  95  
10^13   8   4096    42  861606      3654    78.093 78.098       47  81  96  
10^14   8   4096    40  2579682     3811    101.978 101.982     52  81  95  
10^15   16  8192    76  8047580     4670    503.222 503.228     57  79  90  
10^16   16  8192    74  24435419    4978    827.078 827.078     58  79  91  
10^17   16  8192    72  74366208    4926    1814.42 1814.41     59  81  89  
10^18   16  8192    70  226784107   4393    4581.09 4581.15     66  82  87  
10^19   16  8192    69  692836557   4322    12665.5 12665.7     69  82  87

150000 zeros requested at prec: 256 using 1 thread(s)

10^4    8   4096    64  66      1707    103.801 103.804     18  168 221 
10^5    8   4096    61  173     1998    105.361 105.364     20  169 222 
10^6    8   4096    59  477     2566    111.899 111.902     22  170 222 
10^7    8   4096    57  1331        3299    120.927 120.929     27  171 223 
10^8    8   4096    55  3842        3863    127.581 127.584     32  173 223 
10^9    8   4096    53  11196       4511    138.705 138.708     33  172 222 
10^10   8   4096    50  32887       5181    140.131 140.134     36  173 223 
10^11   8   4096    48  97207       5908    153.058 153.061     39  174 223 
10^12   8   4096    45  288793      6447    153.306 153.283     54  177 223 
10^13   8   4096    42  861606      7360    177.858 177.849     51  176 223 
10^14   8   4096    40  2579682     7914    204.473 204.469     51  164 206 
10^15   16  8192    76  8047580     11486   1170.54 1170.56     53  174 218 
10^16   16  8192    74  24435419    12417   1652.78 1652.77     56  175 219 
10^17   16  8192    72  74366208    13258   2998.9 2998.91      61  177 219 
10^18   16  8192    70  226784107   14182   5690.92 5691.11     64  177 220 
10^19   16  8192    69  692836557   15034   15764.1 15764.3     68  178 219

150000 zeros requested at prec: 512 using 1 thread(s)

10^4    8   4096    64  66      2263    161.528 161.532     17  279 370 
10^5    8   4096    61  173     2644    159.274 159.275     21  281 371 
10^6    8   4096    59  477     3384    174.738 174.716     22  278 366 
10^7    8   4096    57  1331        4034    187.525 187.516     32  246 322 
10^8    8   4096    55  3842        4663    199.424 199.418     31  234 305 
10^9    8   4096    53  11196       5216    211.912 211.92      37  223 288 
10^10   8   4096    50  32887       5842    223.743 223.748     37  211 272 
10^11   8   4096    48  97207       6381    235.218 235.22      40  199 256 
10^12   8   4096    45  288793      6929    245.364 245.364     45  187 239 
10^13   8   4096    42  861606      7470    259.342 259.341     47  176 223 
10^14   8   4096    40  2579682     7914    299.272 299.271     51  164 206 
10^15   16  8192    76  8047580     18663   1612.79 1612.78     53  366 474 
10^16   16  8192    74  24435419    19966   2302.38 2302.39     61  363 468 
10^17   16  8192    72  74366208    21290   4118.32 4118.46     63  353 454 
10^18   16  8192    70  226784107   22510   9689.17 9689.25     63  345 443
10^19   16  8192    69  692836557   23784   28028.2 28029.5     66  336 430
rudolph-git-acc commented 3 years ago

One is related to the number of requested zeros. The local zeros algorithm roughly progresses through three stages:

  1. make the table that requires J iterations
  2. dft
  3. turing's method

I haven't profiled the code in detail, but I'm pretty sure that (2) dominates run time at low height and (1) dominates at greater heights. With the current heuristics system, the number of requested zeros only comes into play at step (3) if I remember correctly

Your intuition appears to be strongly reinforced by the data (updated with improved timing).

I time-profiled the stages and only report those detailed processes that have some impact on timing. Requesting 150000 zeros actually makes the isolate_zeros stage dominant from 10^9-10^14. The do_convolution process dominates 10^4'-10^9 and 10^15 and from there on the S_table strongly and increasingly dominates (single thread).

The do convolution step clearly consumes most time within the DFT stage. It seems quite stable though at the larger n and doesn't exceed 8-9 minutes (i.e. not a huge overhead compared to the forever-growing S_table). Behaviour could change at greater heights though, so I am checking 10^20-10^22 as well (now using 64 threads to make the generation of the S_table tractable).

EDIT: added 10^20-10^22 data. Surprisingly, timings for do_convolution are declining and running time for isolate_zeros even dominates over it at 10^22.

p15-git-acc commented 3 years ago

Thanks for generating that profiling data. I was surprised that increasing the number of requested zeros makes the S_table take longer to make.

I've submitted https://github.com/fredrik-johansson/arb/pull/340 which adds multi-threading to the dft. I don't know how it would affect the hardy z zeros function but I assume it would speed it up at least a little bit in the ranges where the do_convolution step dominates.

rudolph-git-acc commented 3 years ago

I was surprised that increasing the number of requested zeros makes the S_table take longer to make.

This is very strange indeed and comparing the separate runs in detail, I spotted similar runtime differences for e.g. do_convolution. After rerunning a couple of n and comparing 1,000 with 150,000 zeros requested, timings perfectly match up.

A possible cause for the differences between the two runs, is that in parallel to the 150,000 zero run, I launched another 3 threads to produce the relative accuracy data in parallel. On a 32 core machine, these are not supposed to interfere with each other, however you'll never know how the virtual threading/swapping is managed exactly.

I'll rerun the time profiling overnight as the only task on the machine and report back. EDIT: yes, this solves the problem. Timing data looks consistent now. Will also rerun the relative accuracy data above in single batches.

p15-git-acc commented 3 years ago

in parallel to the 150,000 zero run, I launched another 3 threads to produce the relative accuracy data in parallel. On a 32 core machine, these are not supposed to interfere with each other

Yeah I don't know that much about computer architecture but I think if a lot of memory is being accessed by the threads outside of the per-core memory cache then it can cause a slowdown, even if the threads are not accessing the same data.

rudolph-git-acc commented 3 years ago

In the spirit of squeezing more seconds out of the dominant S_table, I did some time measurements within the _platt_smk-routine. The k-loopconsumes quite a bit of time, but I don't see any further improvements here.

Maybe there is something to be gained from get_smk_index. This routine only takes B andj(I'll discuss prec later) as inputs and is 'called' for each j. However, its output m remains increasingly unchanged when j increases. Here is a short table to illustrate this phenomenon.

          # consecutive j with m unchanged
j       B=4096      B=8192
10^3        1       1
10^4        15      7
10^5        153     77
10^6        1531        766
10^7        15305       7664
10^8        153271      76636
10^9        1532587     766294

The benefit unfortunately halves when B doubles, but it still seems there is a possible gain to be reaped here. For instance when j runs from 10^10 to 10^11, get_smk_index only needs to be 'called' 9x10^10/7660000=~11750 times instead of 9x10^10 times. And this benefit would increase with height.

A thought about prec. When I reduce _platt_smk to:

void
_platt_smk(acb_ptr table, const arb_t t0, slong A, slong B,
        slong Jstart, slong Jstop, slong K, slong prec)
{
    slong j;
    for (j = Jstart; j <= Jstop; j++)
    {
        get_smk_index(&m, B, j, prec);
    }
}

and run j from 1 .. 10^k (=J) with B=8192, I get (time in secs):

k              prec=64         prec=128        prec=256
6       0.58        0.73        1.31
7       5.62        7.28        13.31
8       56.7        72.7        130.9
9       573.1       730.4       1313.9

Since get_smk_index has an internal 'prec augmenter', starting with a lower prec =64 when calling it removes the possible 'overkill' from using the general prec-setting (which is at least 256+ at larger heights). Or is the generic prec required to compute logjsqrtpi(x, j, prec) for higher j ?

Extrapolating the prec=64-data towards k=11 (n~10^22) through increasing time by a factor 10 for each order, at that level I get a potential saving of 130,000 (prec = 256) - 57,000 (prec=64) = 73,000 secs.

A test run of acb_dirichlet_platts_local_hardy_z_zeros at n=10^16, J = 10^8, 150,000 zeros, prec=256 yields using:

get_smk_index(&m, B, j, 64)  = 922 secs
get_smk_index(&m, B, j, prec)    = 968 secs

with exactly the same output.

p15-git-acc commented 3 years ago

It should be possible to save the whole 130,000 seconds, but I'll wait for the memory related PR to be merged first so there are no editing conflicts in that part of the code.

Or is the generic prec required

No, a generic prec is overly used throughout the code because of a combination of laziness, simplicity, and not wanting to prematurely optimize the precs. For optimal performance, specialized precs would be used everywhere. The prec adjustments that have highest priority for me would be 1) add about log2(height) to the working precision of many calculations that involve variables the same order of magnitude as the height (or preferably avoid these calculations altogether), and 2) reduce prec of calculations of many error terms so that the error of the error has a similar magnitude as the error itself.

p15-git-acc commented 3 years ago

Maybe there is something to be gained from get_smk_index.

In https://github.com/fredrik-johansson/arb/pull/343 these calls are removed.

The k-loop consumes quite a bit of time, but I don't see any further improvements here.

The memory layout of the large table has maximally bad locality properties for building the table and maximally good properties for doing the convolutions. I was thinking that the time spent doing calculations with arb numbers dwarfs the potential savings from memory layout micromanagement but I could be wrong especially when the calculation is a single addmul and the table is large.

Edit: I've updated that PR to speed up the inner k loop by improving locality properties. It might look counterintuitive because it increases memory allocations and arb operation counts. The speedup is less than that obtained by removing get_smk_index from the j loop but I think it's more than just profiling measurement noise.

rudolph-git-acc commented 3 years ago

Great! I will test these updates later as well.

In the mean time I have tested the multi-threaded version that was merged yesterday. It all works very well, it is clearly faster and memory stays flat even with 64 threads. I did observe that when the S_table is created, all 64 threads are 100% occupied, whereas during the DFT-stage threads are loaded 20-25% max.

Here is some timing data that could serve as a benchmark to assess the impact of the next improvements:

150,000 zeros requested at prec: 256 using 64 threads/32 cores:

                                                                                        relative acc (bits)
n      #found      cpu/wall(s)         cpu/wall(s)(#343)       cpu/wall(s)(#344)      lowest  median  highest
10^4    1707    206.106   50.029    205.179 50.17       204.431 49.880      18  168 221
10^5    1998    207.804   53.328    207.607 53.117      207.836 52.695      20  169 222
10^6    2566    216.147   60.307    216.017 60.259      215.637 59.875      22  170 222
10^7    3299    227.215   71.117    225.565 69.343      224.807 69.171      27  171 223
10^8    3863    228.710   78.292    227.257 76.537      226.183 75.418      32  173 223
10^9    4511    236.044   90.195    233.574 87.890      233.553 87.785      33  172 222
10^10   5181    236.351   97.658    236.013 97.460      235.535 97.702      36  173 223
10^11   5908    241.658  108.020    240.396 106.817     240.111 106.696     39  174 223
10^12   6447    242.591  116.951    242.026 115.873     239.780 114.607     54  177 223
10^13   7360    254.234  135.499    250.901 131.814     247.840 129.872     51  176 223
10^14   7914    274.114  161.178    269.177 155.745     260.182 147.612     51  164 206
10^15   11486   1690.26  500.028    1644.49 476.452     1610.49 435.145     53  174 218
10^16   12417   2462.61  365.168    2399.89 361.915     2142.77 357.011     56  175 219
10^17   13258   4468.32  413.738    4224.52 406.065     3426.49 391.563     61  177 219
10^18   14182   10722.5  543.366    10011.9 530.312     7408.85 486.227     64  177 220
10^19   15034   31307.3  877.214    29522.6 850.245     21082.4 718.366     68  178 219
10^20   15985   95085.4 1900.850    87409.8 1784.41     62989.7 1399.54     71  178 218
10^21   16866   285268  4902.740    259813  4494        189136  3401.75     74  178 217
10^22   17804   850591  13805       774321  12627.2     567905  9405.04     76  179 217
10^23   18565               2344050 37574.3     1728240 27775.1     83  179 217

The memory layout of the large table has maximally bad locality properties for building the table and maximally good properties for doing the convolutions.

With incremental powers of 10, we currently see a rapidly increasing time for the S_table (x3 per order of magnitude), a slightly decreasing time for the do-convolution-stage (-10-20 secs per order of magnitude) and a slowly increasing time for the isolate zeros-step (+10-20 secs per order of magnitude). The trade-offs indeed seem to be between the first two stages.

Two further thoughts on the S_table: 1) Probably useless for timings, but one could be very puristic and use acb_mul_arb(accum + k,... when acb_is_zero(accum + k)and otherwise use acb_addmul_arb(accum + k,.... Could be that this additional if even slows things down though... 2) Since J can't be lowered (it has to be ~sqrt(n)), is there any way to reduce K at the expense of some other parameters?

fredrik-johansson commented 3 years ago

Is it possible to rephrase the inner loops in terms of arb_dot/acb_dot? That could give a significant speedup.

p15-git-acc commented 3 years ago

Is it possible to rephrase the inner loops in terms of arb_dot/acb_dot?

Yes, I think this could be done by breaking the j loop into blocks of max length b where m is constant within the block, so that each thread stores an extra (K, b) real matrix and an extra (b, 1) complex matrix with the idea of multiplying them together. A function like acb_dot_arb would be nice.

Edit: implemented in https://github.com/fredrik-johansson/arb/pull/344

p15-git-acc commented 3 years ago

whereas during the DFT-stage threads are loaded 20-25% max.

That's pretty bad. And I tried to be so careful that each thread gets exactly the same number of arb calculations in the dft multithreading :(

p15-git-acc commented 3 years ago

is there any way to reduce K at the expense of some other parameters?

Yes. The parameter tuning is still extremely rough and the tuning script has not accounted at all for execution speed, but rather only equalizing estimated error terms. In particular the way K is tuned right now, the tuning script tries to find K such that the estimated K truncation error matches the estimated Shannon-Whittaker interpolation error which in turn is determined partly by the A and B parameters that determine the shape of the grid. If the script has to double K to decrease the error to match, it will do it even if it only yields a few or even no extra isolated zeros and doubles the execution time. I would not be surprised if changing the combination of parameter values gives 10x or 100x more isolated zeros per cpu usage in the range of n that we have been checking.

rudolph-git-acc commented 3 years ago

Updated the table here with the results from testing #343. Now also included height n=10^23 and the relative accuracies. Will carry out a similar test for #344 after its merge is complete. EDIT: test of #344 is complete, benefits grows to >25% with height.

(...) if changing the combination of parameter values gives 10x or 100x more isolated zeros per cpu usage in the range of n that we have been checking.

That would be truly amazing!

With J quite strictly depending on height n and the J-loop-code nearing its optimal shape, it looks like parameter K is the main lever to pull to further reduce run time (at greater heights). A quick test (using the #343 code) indicates that halving K would improve run time of the J-loop by another ~40% (but probably at the cost of increasing time of the other stages and/or the number of zeros isolated). Few run times (in secs) for _platt_smk only, prec=256, 1 thread:

   n        K              K/2
10^14     23 (40)     14 (20)     A=8,  B=4096, J=2579682,   t0=22514484222485
10^16    364 (74)    212 (37)     A=16, B=8192, J=24435419,  t0=1941393531395154
10^18   3216 (70)   1844 (35)     A=16, B=8192, J=226784107, t0=170553583898990072
p15-git-acc commented 3 years ago

A quick test (using the #343 code) indicates that halving K would improve run time of the J-loop by another ~40% (but probably at the cost of increasing time of the other stages and/or the number of zeros isolated).

This is interesting because it means that by running the j loop twice each with a different k range the peak memory use of the program could be cut in half at a cost of increasing the cpu usage by 20%. It could be a useful trade-off because for example in Platt's thesis "The value of N was the largest power of 2 that allowed the process to run in < 1 Gbyte of memory as this allowed us to make best use of the multi-core CPUs we had available to us." Well now that I think about it the second range of k might not be as efficient as the first range so maybe it would increase the cpu usage by more than 20%.

fredrik-johansson commented 3 years ago

It would be great if you could add a zeta zeros program to the examples directory so that others can play with this easily.

p15-git-acc commented 3 years ago

This leaves open my suggestion in #320 to let the A=16-heuristics kick in at n=10^16 instead of at n=10^15. Any thoughts on that?

In https://github.com/fredrik-johansson/arb/pull/346 I've made this change. Both the A=8 and A=16 models are now interpolated in regions that overlap between n=1e7 and n=5e22. This has changed which parameter values are used for each height according to the tuning script, but I haven't actually run the local hardy z zeros function for heights that aren't covered by the unit tests. I'm curious how the local hardy z function behaves for each model in this interval and I wonder where is the best place to switch between models, in terms of number of zeros per cpu cost.

It would be great if you could add a zeta zeros program to the examples directory so that others can play with this easily.

That PR also introduces tested wrapper functions which could perhaps be used by an example program in the future.

rudolph-git-acc commented 3 years ago

I'm curious how the local hardy z function behaves for each model in this interval and I wonder where is the best place to switch between models, in terms of number of zeros per cpu cost.

I will run 10^7-10^22 for both models A=8, B=4096 and A=16, B=8192 requesting 40,000and 150,000 respectively, and using prec=256. Also will include the median relative accuracy of the zeros isolated.

In #346 I've made this change.

A small comment on the pull requests: in the documentation of acb_dirichlet_platt_zeta_zeros it could be helpful to state explicitly that the user has the option of multi-threading through flint_set_num_threads(numthreads);. Maybe this line could also be included in the wrappers (once they land in the example directory).

And I tried to be so careful that each thread gets exactly the same number of arb calculations in the dft multithreading

The DFT multi-threading code must have been done very elegantly, just watch how the threads get intertwined beautifully :-)

Screenshot from 2020-09-27 23-13-01

p15-git-acc commented 3 years ago

A small comment on the pull requests: in the documentation of acb_dirichlet_platt_zeta_zeros it could be helpful to state explicitly that the user has the option of multi-threading through flint_set_num_threads(numthreads);. Maybe this line could also be included in the wrappers (once they land in the example directory).

The PR is now updated to mention multi-threading in the zeta zeros docs.

rudolph-git-acc commented 3 years ago

Here is the data for the A=8 and A=16 models across the range 1e7..1e22. I also included 1e4..1e6.

Initial observations:

  1. 1e4 yields dramatically fewer zeros.
  2. There is a sudden run time improvement at 1e16, which seems to be caused by the multi-threading for the S_table kicking in for J > 10mln. Height 1e15 has J=8,047,580, hence is excluded from multi-threading. Maybe relaxing the constraint on J a bit would result in more steadily increasing timings?
  3. Purely considering CPU-time / #zeros_found, the A=8 model seems to beat A=16 up to 1e20, however when taking the median relative accuracy into account as well, the A=16 model starts to win from 1e18 upwards.
  4. The run times remain more or less the same when compared to version #344, however for model A=16 at heights >1e19 they become increasingly worse.
p15-git-acc commented 3 years ago

Thanks for the data!

1e4 yields dramatically fewer zeros.

Yes, my reasoning is that the multieval strategy is probably not practical at such low n (compared to direct evaluation of zeta) but if it's fast then at least it can be used for unit tests. So if it finds only 1% as many zeros but runs in 2% of the time thereby bringing run time down from a minute to a second then I think it's worth it.

Maybe relaxing the constraint on J a bit would result in more steadily increasing timings?

In the PR I've now completely removed the constraint. The main reason for the limit was to prevent unnecessary memory usage, but that reason no longer applies now that the memory overhead of threads has been reduced.

Purely considering CPU-time / #zeros_found, the A=8 model seems to beat A=16 up to 1e20, however when taking the median relative accuracy into account as well, the A=16 model starts to win from 1e18 upwards.

In the PR I've moved the threshold from 2e15 up to 2e17.

The run times remain more or less the same when compared to version #344, however for model A=16 at heights >1e19 they become increasingly worse.

I've left this alone for now.

rudolph-git-acc commented 3 years ago

In the PR I've now completely removed the constraint. (..) In the PR I've moved the threshold from 2e15 up to 2e17

Great, after these changes it looks like this (prec=256, 64 threads/32 cores):

                                        relative acc (bits)
10^x    #found   cpu(s)  wall(s)      lowest  median  highest
4   24    0.335    0.296    24  58  69  
5   1998    203.272   53.112    20  169 222 
6   2566    211.309   59.422    22  170 222 
7   3331    221.839   69.719    27  171 223 
8   3851    222.739   74.851    33  173 223 
9   4495    225.625   86.052    32  172 222 
10  5153    229.959   95.071    37  173 223 
11  5924    236.096  104.810    38  174 223 
12  6665    240.604  113.533    43  175 223 
13  7360    250.961  122.550    51  175 221 
14  7862    277.876  126.116    50  163 204 
15  8251    350.273  129.422    56  152 189 
16  8494    575.500  137.822    60  141 173 
17  8746    1220.52  148.359    63  130 157 
18  14182   7601.58  492.587    64  177 220 
19  15034   21885.4  733.784    66  177 219 
20  15985   65780.2  1446.51    71  178 218 
21  16866   195114   3510.46    74  178 217 
22  17804   582717   9641.12    76  179 217

The unconstrained multi-threading yields some additional benefits and the kick-in point of the A=16-model looks right now.

rudolph-git-acc commented 3 years ago

(...) In particular the way K is tuned right now, the tuning script tries to find K such that the estimated K truncation error matches the estimated Shannon-Whittaker interpolation error which in turn is determined partly by the A and B parameters that determine the shape of the grid. If the script has to double K to decrease the error to match, it will do it even if it only yields a few or even no extra isolated zeros and doubles the execution time. (...)

Your conjecture appears to be strongly supported by the data. I ran the latest version (#346) with K=K/2. This doesn't yield any significant benefits for the A=8-model, however when applied to the A=16-model (>1e17) it induces a significant improvement. There is a relatively low drop in the zeros isolated as well as in their accuracies, however this is compensated by a step change improvement in run times (up to 40%). Also tried K=K/4, but that clearly is a step too far and would halve the number of zeros isolated.

Not sure how far you'd like to stretch the parameter tuning, but this one feels like 'low hanging fruit' for the A=16 model.

Kdiv2stats

*) I took the top line of the 1e23-data from here (PR #344). Since there is slight increase in timings at these heights after #346 and I didn't rerun 1e23 with #346 code, I augmented these particular run times by 3% (in the same ballpark as the increase observed at 1e22).

p15-git-acc commented 3 years ago

At these large heights where A=16 is better than A=8 and the J loop cpu usage dominates the convolution cpu usage, I think that further increasing (doubling) B would probably also greatly increase the number of zeros per cpu cost but it would probably require re-tuning the other parameters and would increase the peak memory usage. I haven't checked this though.

p15-git-acc commented 3 years ago

I'm closing this because the issue in the title has been solved.