Closed p15-git-acc closed 3 years ago
Did some further testing on this and the data below confirms that multi-threading doesn't seem to yield any benefits <10^16
, Surprisingly, at these lower heights it doesn't seem to have any adverse implications either on CPU-time or memory needs.
For the CPU time, I think the explanation is that building the S_table
at these lower heights is just a very small fraction of the overall CPU-time required (J
and 'A, B, K' are all relatively small).
Even at 10^15
, memory claims for a single thread or for 64 threads are the same (64-threaded is only 7 secs faster). The only explanation I could think of, is that other processes than building the S_table
, drive the memory statistics below. Only when the A=16
heuristics kick in (currently at 10^15
) the total memory consumption for the S_table
will exceed the claims from the other processes and its memory needs start to dominate.
Since there don't seem to be any benefits nor adverse implications for a user accidentally running lower heights with too many threads, I wonder whether a threshold actually needs to be hard coded.
This leaves open my suggestion in #320 to let the A=16
-heuristics kick in at n=10^16
instead of at n=10^15
. Any thoughts on that?
threads height prec #req cpu wall(s) virt peak res peak(MB):
1 10^4 256 1000 83.819 83.820 246.66 672.71 226.54 652.48
64 10^4 256 1000 83.699 83.700 246.66 672.71 226.62 652.60
1 10^5 256 1000 82.501 82.501 280.37 672.71 260.42 652.48
64 10^5 256 1000 84.449 84.450 280.37 672.71 260.37 652.60
1 10^6 256 1000 82.860 82.862 280.37 672.71 260.42 652.48
64 10^6 256 1000 82.109 82.110 280.37 672.71 260.37 652.60
1 10^7 256 1000 82.658 82.659 280.37 672.71 260.42 652.48
64 10^7 256 1000 82.702 82.704 280.37 672.71 260.37 652.60
1 10^8 256 1000 81.022 81.023 280.37 672.71 260.42 652.48
64 10^8 256 1000 80.902 80.903 280.37 672.71 260.37 652.60
1 10^9 256 1000 80.018 80.019 267.88 672.71 247.95 652.48
64 10^9 256 1000 79.399 79.400 267.88 672.71 247.90 652.60
1 10^10 256 1000 76.946 76.947 267.88 672.71 247.95 652.48
64 10^10 256 1000 76.024 76.026 267.88 672.71 247.90 652.60
1 10^11 256 1000 75.951 75.952 267.88 672.71 247.95 652.48
64 10^11 256 1000 75.115 75.116 267.88 672.71 247.90 652.60
1 10^12 256 1000 74.821 74.821 251.23 672.71 231.30 652.48
64 10^12 256 1000 73.749 73.751 251.23 672.71 231.25 652.60
1 10^13 256 1000 78.731 78.732 251.23 672.71 231.30 652.48
64 10^13 256 1000 78.050 78.050 251.23 672.71 231.25 652.60
1 10^14 256 1000 98.647 98.648 251.23 672.71 231.30 652.48
64 10^14 256 1000 98.285 98.287 251.23 672.71 231.25 652.60
1 10^15 256 1000 729.304 729.312 1077.56 3045.38 1057.51 3025.27 (A=16, K=76)
64 10^15 256 1000 722.491 722.502 1077.56 3045.38 1057.46 3025.40
1 10^16 256 1000 1066.24 1066.21 1060.68 3004.51 1040.27 2984.18 (A=16, K=74)
64 10^16 256 1000 2038.12 605.065 5188.67 62425.66 1137.45 57888.95
1 10^17 256 1000 2058.28 2058.34 1067.74 2963.58 1047.46 2943.36 (A=16, K=72)
64 10^17 256 1000 4490.28 606.213 5259.71 60865.66 1151.52 56343.93
1 10^18 256 1000 5127.71 5127.76 1074.13 2922.18 1053.92 2901.79 (A=16, K=70)
64 10^18 256 1000 13246.2 724.053 5434.19 60865.66 1352.93 56343.93
I haven't done anything for this issue yet, but I thought I'd mention a couple of things related to creating/interpreting these tables.
One is related to the number of requested zeros. The local zeros algorithm roughly progresses through three stages:
1) make the table that requires J iterations 2) dft 3) turing's method
I haven't profiled the code in detail, but I'm pretty sure that (2) dominates run time at low height and (1) dominates at greater heights. With the current heuristics system, the number of requested zeros only comes into play at step (3) if I remember correctly. Anyway my point is that if you request many more zeros (like A*B
zeros) and note the number of zeros returned then it will give more information for the table and I don't think it should require much more run time. With this information the per-isolated-zero amortized cpu usage could be calculated.
The second thing I thought I should mention is that I think when you made the table, you were running a version of the code that already has a hard coded multithreading cutoff. This is why it looks like multi-threading isn't doing anything for low heights -- it's because it wasn't being used.
(...)you were running a version of the code that already has a hard coded multithreading cutoff.(...)
Yes, that explains it! I hadn't realised that this restriction was hard coded in the latest version of multieval.c
:
if (flint_get_num_threads() > 1 && J > 10000000)
{
acb_dirichlet_platt_multieval_threaded(
out, T, A, B, h, J, K, sigma, prec);
Running the table again without this constraint, yields the results below. At lower heights, multi-threading works adversely (more memory needed and slower). The first benefits from threading become visible around 10^13
, but come at the cost of considerable memory claims. Since the CPU wall-times are quite low at these heights, the constraint J > 10mln
actually does look reasonable (unless a user wants to compute many batches of 1000 zeros at 10^14
, then a ~30% speed gain will not be accessible).
I will try to do some profiling on the three stages you mentioned in your post.
#zeros requested = 1000, prec = 256:
thread height #found cpu wall(s) virt peak res peak(MB)
1 10^4 A: 8 B: 4096 K: 64 J: 66 1000 84.094 84.096 246.66 672.70 226.56 652.48
64 10^4 A: 8 B: 4096 K: 64 J: 66 1000 89.747 89.731 1655.99 14232.47 1092.89 13189.96
1 10^5 A: 8 B: 4096 K: 61 J: 173 1000 82.209 82.209 280.24 672.70 260.20 652.48
64 10^5 A: 8 B: 4096 K: 61 J: 173 1000 88.192 88.172 1783.99 14232.47 1093.23 13189.96
1 10^6 A: 8 B: 4096 K: 59 J: 477 1000 84.895 84.897 280.24 672.70 260.20 652.48
64 10^6 A: 8 B: 4096 K: 59 J: 477 1000 88.160 88.120 2231.99 14232.47 1095.13 13189.96
1 10^7 A: 8 B: 4096 K: 57 J: 1331 1000 82.813 82.814 280.24 672.70 260.20 652.48
64 10^7 A: 8 B: 4096 K: 57 J: 1331 1000 87.733 87.641 2540.19 14232.47 1022.57 13189.96
1 10^8 A: 8 B: 4096 K: 55 J: 3842 1000 79.647 79.647 280.24 672.70 260.20 652.48
64 10^8 A: 8 B: 4096 K: 55 J: 3842 1000 85.641 85.428 3756.19 14466.67 1026.29 13189.96
1 10^9 A: 8 B: 4096 K: 53 J: 11196 1000 79.361 79.361 267.65 672.70 247.62 652.48
64 10^9 A: 8 B: 4096 K: 53 J: 11196 1000 84.960 84.399 4908.19 15127.65 1032.13 13189.96
1 10^10 A: 8 B: 4096 K: 50 J: 32887 1000 76.459 76.458 261.28 672.70 241.24 652.48
64 10^10 A: 8 B: 4096 K: 50 J: 32887 1000 82.844 81.202 5164.19 15127.65 1038.04 13189.96
1 10^11 A: 8 B: 4096 K: 48 J: 97207 1000 75.185 75.183 261.28 672.70 241.24 652.48
64 10^11 A: 8 B: 4096 K: 48 J: 97207 1000 82.755 78.383 5164.19 15127.65 1046.16 13189.96
1 10^12 A: 8 B: 4096 K: 45 J: 288793 1000 74.382 74.382 261.28 672.70 241.24 652.48
64 10^12 A: 8 B: 4096 K: 45 J: 288793 1000 87.244 75.156 5164.19 15127.65 1056.92 13189.96
1 10^13 A: 8 B: 4096 K: 42 J: 861606 1000 78.452 78.451 249.12 672.70 229.09 652.48
64 10^13 A: 8 B: 4096 K: 42 J: 861606 1000 103.912 70.913 5164.19 15127.65 1061.01 13189.96
1 10^14 A: 8 B: 4096 K: 40 J: 2579682 1000 99.054 99.053 249.12 672.70 229.09 652.48
64 10^14 A: 8 B: 4096 K: 40 J: 2579682 1000 162.040 69.185 5164.19 15127.65 1075.24 13189.96
1 10^15 A: 16 B: 8192 K: 76 J: 8047580 1000 739.403 739.402 1065.14 3044.97 1044.98 3024.94
64 10^15 A: 16 B: 8192 K: 76 J: 8047580 1000 1101.740 596.905 5349.18 64924.68 1331.13 60427.45
1 10^16 A: 16 B: 8192 K: 74 J: 24435419 1000 1073.620 1073.620 1228.68 3044.97 1208.52 3024.94
64 10^16 A: 16 B: 8192 K: 74 J: 24435419 1000 2071.190 597.879 5339.47 64924.68 1342.18 60427.45
1 10^17 A: 16 B: 8192 K: 72 J: 74366208 1000 2132.860 2132.850 1111.97 3046.44 1091.94 3026.09
64 10^17 A: 16 B: 8192 K: 72 J: 74366208 1000 4904.930 628.124 5311.86 64924.68 1332.42 60427.45
Besides requesting at least A*B
zeros (say 150000 of them instead of 1000) and showing the number of returned zeros in the table so that amortized cost can be calculated, the median mag_rad or rel_accuracy of the isolated zeros would also be useful. These would be especially helpful in the future if different parameter settings are compared for the same height.
I think it should also be possible to slightly irregularly tile the J loop so that each of n threads does close to J/n iterations and each thread writes only to its own section of the table. This would mean that the per-thread temporary tables would not be required, so multi-threading wouldn't have any memory premium. Less importantly it also occurs to me that the larger j iterations might take longer than the smaller j iterations, so maybe trying to give each thread approximately J/n iterations would not be ideal in terms of balancing the calculations among threads. Another thing is that I didn't try to micro optimize the j loop itself because I profiled mainly at lower heights where that loop is not the limiting factor.
Edit: https://github.com/fredrik-johansson/arb/pull/339 removes the multi-threading memory overhead
339 removes the multi-threading memory overhead
Wow! That is a step change forward, since a larger number of threads will no longer induce a memory constraint. Very nice :-) I'll test it later today.
Besides requesting at least A*B zeros (say 150000 of them instead of 1000) and showing the number of returned zeros in the table so that amortized cost can be calculated, the median mag_rad or rel_accuracy of the isolated zeros would also be useful.
Below the first results. Let me know if other data is required. I used arb_rel_accuracy_bits
as the criterium to sort the full list returned by acb_dirichlet_platts_local_hardy_z_zeros
and then computed the lowest, median and highest relative accuracy of the actual zeros found.
150000 zeros requested at prec: 128 using 1 thread(s)
relative acc (bits)
height A B K J #found cpu/wall(s) lowest median highest
10^4 8 4096 64 66 1010 54.33 54.345 19 72 93
10^5 8 4096 61 173 1184 55.265 55.266 19 74 94
10^6 8 4096 59 477 1475 57.677 57.678 24 75 94
10^7 8 4096 57 1331 1867 59.884 59.871 27 76 95
10^8 8 4096 55 3842 2158 61.377 61.369 31 76 95
10^9 8 4096 53 11196 2464 62.9 62.895 34 77 94
10^10 8 4096 50 32887 2731 63.718 63.715 37 78 95
10^11 8 4096 48 97207 2963 65.524 65.522 43 79 95
10^12 8 4096 45 288793 3089 66.374 66.373 50 81 95
10^13 8 4096 42 861606 3654 78.093 78.098 47 81 96
10^14 8 4096 40 2579682 3811 101.978 101.982 52 81 95
10^15 16 8192 76 8047580 4670 503.222 503.228 57 79 90
10^16 16 8192 74 24435419 4978 827.078 827.078 58 79 91
10^17 16 8192 72 74366208 4926 1814.42 1814.41 59 81 89
10^18 16 8192 70 226784107 4393 4581.09 4581.15 66 82 87
10^19 16 8192 69 692836557 4322 12665.5 12665.7 69 82 87
150000 zeros requested at prec: 256 using 1 thread(s)
10^4 8 4096 64 66 1707 103.801 103.804 18 168 221
10^5 8 4096 61 173 1998 105.361 105.364 20 169 222
10^6 8 4096 59 477 2566 111.899 111.902 22 170 222
10^7 8 4096 57 1331 3299 120.927 120.929 27 171 223
10^8 8 4096 55 3842 3863 127.581 127.584 32 173 223
10^9 8 4096 53 11196 4511 138.705 138.708 33 172 222
10^10 8 4096 50 32887 5181 140.131 140.134 36 173 223
10^11 8 4096 48 97207 5908 153.058 153.061 39 174 223
10^12 8 4096 45 288793 6447 153.306 153.283 54 177 223
10^13 8 4096 42 861606 7360 177.858 177.849 51 176 223
10^14 8 4096 40 2579682 7914 204.473 204.469 51 164 206
10^15 16 8192 76 8047580 11486 1170.54 1170.56 53 174 218
10^16 16 8192 74 24435419 12417 1652.78 1652.77 56 175 219
10^17 16 8192 72 74366208 13258 2998.9 2998.91 61 177 219
10^18 16 8192 70 226784107 14182 5690.92 5691.11 64 177 220
10^19 16 8192 69 692836557 15034 15764.1 15764.3 68 178 219
150000 zeros requested at prec: 512 using 1 thread(s)
10^4 8 4096 64 66 2263 161.528 161.532 17 279 370
10^5 8 4096 61 173 2644 159.274 159.275 21 281 371
10^6 8 4096 59 477 3384 174.738 174.716 22 278 366
10^7 8 4096 57 1331 4034 187.525 187.516 32 246 322
10^8 8 4096 55 3842 4663 199.424 199.418 31 234 305
10^9 8 4096 53 11196 5216 211.912 211.92 37 223 288
10^10 8 4096 50 32887 5842 223.743 223.748 37 211 272
10^11 8 4096 48 97207 6381 235.218 235.22 40 199 256
10^12 8 4096 45 288793 6929 245.364 245.364 45 187 239
10^13 8 4096 42 861606 7470 259.342 259.341 47 176 223
10^14 8 4096 40 2579682 7914 299.272 299.271 51 164 206
10^15 16 8192 76 8047580 18663 1612.79 1612.78 53 366 474
10^16 16 8192 74 24435419 19966 2302.38 2302.39 61 363 468
10^17 16 8192 72 74366208 21290 4118.32 4118.46 63 353 454
10^18 16 8192 70 226784107 22510 9689.17 9689.25 63 345 443
10^19 16 8192 69 692836557 23784 28028.2 28029.5 66 336 430
One is related to the number of requested zeros. The local zeros algorithm roughly progresses through three stages:
- make the table that requires J iterations
- dft
- turing's method
I haven't profiled the code in detail, but I'm pretty sure that (2) dominates run time at low height and (1) dominates at greater heights. With the current heuristics system, the number of requested zeros only comes into play at step (3) if I remember correctly
Your intuition appears to be strongly reinforced by the data (updated with improved timing).
I time-profiled the stages and only report those detailed processes that have some impact on timing. Requesting 150000 zeros actually makes the isolate_zeros
stage dominant from 10^9
-10^14
. The do_convolution
process dominates 10^4'-10^9
and 10^15
and from there on the S_table
strongly and increasingly dominates (single thread).
The do convolution
step clearly consumes most time within the DFT stage. It seems quite stable though at the larger n
and doesn't exceed 8-9 minutes (i.e. not a huge overhead compared to the forever-growing S_table
). Behaviour could change at greater heights though, so I am checking 10^20-10^22
as well (now using 64 threads to make the generation of the S_table tractable).
EDIT: added 10^20-10^22
data. Surprisingly, timings for do_convolution
are declining and running time for isolate_zeros
even dominates over it at 10^22
.
Thanks for generating that profiling data. I was surprised that increasing the number of requested zeros makes the S_table take longer to make.
I've submitted https://github.com/fredrik-johansson/arb/pull/340 which adds multi-threading to the dft. I don't know how it would affect the hardy z zeros function but I assume it would speed it up at least a little bit in the ranges where the do_convolution
step dominates.
I was surprised that increasing the number of requested zeros makes the S_table take longer to make.
This is very strange indeed and comparing the separate runs in detail, I spotted similar runtime differences for e.g. do_convolution
. After rerunning a couple of n
and comparing 1,000 with 150,000 zeros requested, timings perfectly match up.
A possible cause for the differences between the two runs, is that in parallel to the 150,000 zero run, I launched another 3 threads to produce the relative accuracy
data in parallel. On a 32 core machine, these are not supposed to interfere with each other, however you'll never know how the virtual threading/swapping is managed exactly.
I'll rerun the time profiling overnight as the only task on the machine and report back. EDIT: yes, this solves the problem. Timing data looks consistent now. Will also rerun the relative accuracy
data above in single batches.
in parallel to the 150,000 zero run, I launched another 3 threads to produce the relative accuracy data in parallel. On a 32 core machine, these are not supposed to interfere with each other
Yeah I don't know that much about computer architecture but I think if a lot of memory is being accessed by the threads outside of the per-core memory cache then it can cause a slowdown, even if the threads are not accessing the same data.
In the spirit of squeezing more seconds out of the dominant S_table
, I did some time measurements within the _platt_smk
-routine. The k-loop
consumes quite a bit of time, but I don't see any further improvements here.
Maybe there is something to be gained from get_smk_index
. This routine only takes B
andj
(I'll discuss prec
later) as inputs and is 'called' for each j
. However, its output m
remains increasingly unchanged when j
increases. Here is a short table to illustrate this phenomenon.
# consecutive j with m unchanged
j B=4096 B=8192
10^3 1 1
10^4 15 7
10^5 153 77
10^6 1531 766
10^7 15305 7664
10^8 153271 76636
10^9 1532587 766294
The benefit unfortunately halves when B
doubles, but it still seems there is a possible gain to be reaped here. For instance when j
runs from 10^10
to 10^11
, get_smk_index
only needs to be 'called' 9x10^10/7660000=~11750
times instead of 9x10^10
times. And this benefit would increase with height.
A thought about prec
. When I reduce _platt_smk
to:
void
_platt_smk(acb_ptr table, const arb_t t0, slong A, slong B,
slong Jstart, slong Jstop, slong K, slong prec)
{
slong j;
for (j = Jstart; j <= Jstop; j++)
{
get_smk_index(&m, B, j, prec);
}
}
and run j
from 1 .. 10^k (=J)
with B=8192
, I get (time in secs):
k prec=64 prec=128 prec=256
6 0.58 0.73 1.31
7 5.62 7.28 13.31
8 56.7 72.7 130.9
9 573.1 730.4 1313.9
Since get_smk_index
has an internal 'prec augmenter', starting with a lower prec =64
when calling it removes the possible 'overkill' from using the general prec
-setting (which is at least 256+ at larger heights). Or is the generic prec
required to compute logjsqrtpi(x, j, prec)
for higher j
?
Extrapolating the prec=64
-data towards k=11
(n~10^22
) through increasing time by a factor 10
for each order, at that level I get a potential saving of 130,000
(prec = 256
) - 57,000
(prec=64
) = 73,000 secs
.
A test run of acb_dirichlet_platts_local_hardy_z_zeros
at n=10^16, J = 10^8, 150,000 zeros, prec=256
yields using:
get_smk_index(&m, B, j, 64) = 922 secs
get_smk_index(&m, B, j, prec) = 968 secs
with exactly the same output.
It should be possible to save the whole 130,000 seconds, but I'll wait for the memory related PR to be merged first so there are no editing conflicts in that part of the code.
Or is the generic
prec
required
No, a generic prec
is overly used throughout the code because of a combination of laziness, simplicity, and not wanting to prematurely optimize the precs. For optimal performance, specialized precs would be used everywhere. The prec adjustments that have highest priority for me would be 1) add about log2(height) to the working precision of many calculations that involve variables the same order of magnitude as the height (or preferably avoid these calculations altogether), and 2) reduce prec of calculations of many error terms so that the error of the error has a similar magnitude as the error itself.
Maybe there is something to be gained from get_smk_index.
In https://github.com/fredrik-johansson/arb/pull/343 these calls are removed.
The k-loop consumes quite a bit of time, but I don't see any further improvements here.
The memory layout of the large table has maximally bad locality properties for building the table and maximally good properties for doing the convolutions. I was thinking that the time spent doing calculations with arb numbers dwarfs the potential savings from memory layout micromanagement but I could be wrong especially when the calculation is a single addmul and the table is large.
Edit: I've updated that PR to speed up the inner k loop by improving locality properties. It might look counterintuitive because it increases memory allocations and arb operation counts. The speedup is less than that obtained by removing get_smk_index from the j loop but I think it's more than just profiling measurement noise.
Great! I will test these updates later as well.
In the mean time I have tested the multi-threaded version that was merged yesterday. It all works very well, it is clearly faster and memory stays flat even with 64 threads. I did observe that when the S_table
is created, all 64 threads are 100% occupied, whereas during the DFT-stage threads are loaded 20-25% max.
Here is some timing data that could serve as a benchmark to assess the impact of the next improvements:
150,000 zeros requested at prec: 256 using 64 threads/32 cores:
relative acc (bits)
n #found cpu/wall(s) cpu/wall(s)(#343) cpu/wall(s)(#344) lowest median highest
10^4 1707 206.106 50.029 205.179 50.17 204.431 49.880 18 168 221
10^5 1998 207.804 53.328 207.607 53.117 207.836 52.695 20 169 222
10^6 2566 216.147 60.307 216.017 60.259 215.637 59.875 22 170 222
10^7 3299 227.215 71.117 225.565 69.343 224.807 69.171 27 171 223
10^8 3863 228.710 78.292 227.257 76.537 226.183 75.418 32 173 223
10^9 4511 236.044 90.195 233.574 87.890 233.553 87.785 33 172 222
10^10 5181 236.351 97.658 236.013 97.460 235.535 97.702 36 173 223
10^11 5908 241.658 108.020 240.396 106.817 240.111 106.696 39 174 223
10^12 6447 242.591 116.951 242.026 115.873 239.780 114.607 54 177 223
10^13 7360 254.234 135.499 250.901 131.814 247.840 129.872 51 176 223
10^14 7914 274.114 161.178 269.177 155.745 260.182 147.612 51 164 206
10^15 11486 1690.26 500.028 1644.49 476.452 1610.49 435.145 53 174 218
10^16 12417 2462.61 365.168 2399.89 361.915 2142.77 357.011 56 175 219
10^17 13258 4468.32 413.738 4224.52 406.065 3426.49 391.563 61 177 219
10^18 14182 10722.5 543.366 10011.9 530.312 7408.85 486.227 64 177 220
10^19 15034 31307.3 877.214 29522.6 850.245 21082.4 718.366 68 178 219
10^20 15985 95085.4 1900.850 87409.8 1784.41 62989.7 1399.54 71 178 218
10^21 16866 285268 4902.740 259813 4494 189136 3401.75 74 178 217
10^22 17804 850591 13805 774321 12627.2 567905 9405.04 76 179 217
10^23 18565 2344050 37574.3 1728240 27775.1 83 179 217
The memory layout of the large table has maximally bad locality properties for building the table and maximally good properties for doing the convolutions.
With incremental powers of 10, we currently see a rapidly increasing time for the S_table
(x3 per order of magnitude), a slightly decreasing time for the do-convolution
-stage (-10-20 secs per order of magnitude) and a slowly increasing time for the isolate zeros
-step (+10-20 secs per order of magnitude). The trade-offs indeed seem to be between the first two stages.
Two further thoughts on the S_table
:
1) Probably useless for timings, but one could be very puristic and use acb_mul_arb(accum + k,...
when acb_is_zero(accum + k)
and otherwise use acb_addmul_arb(accum + k,...
. Could be that this additional if
even slows things down though...
2) Since J
can't be lowered (it has to be ~sqrt(n)), is there any way to reduce K
at the expense of some other parameters?
Is it possible to rephrase the inner loops in terms of arb_dot/acb_dot? That could give a significant speedup.
Is it possible to rephrase the inner loops in terms of arb_dot/acb_dot?
Yes, I think this could be done by breaking the j loop into blocks of max length b where m is constant within the block, so that each thread stores an extra (K, b) real matrix and an extra (b, 1) complex matrix with the idea of multiplying them together. A function like acb_dot_arb
would be nice.
Edit: implemented in https://github.com/fredrik-johansson/arb/pull/344
whereas during the DFT-stage threads are loaded 20-25% max.
That's pretty bad. And I tried to be so careful that each thread gets exactly the same number of arb calculations in the dft multithreading :(
is there any way to reduce K at the expense of some other parameters?
Yes. The parameter tuning is still extremely rough and the tuning script has not accounted at all for execution speed, but rather only equalizing estimated error terms. In particular the way K is tuned right now, the tuning script tries to find K such that the estimated K truncation error matches the estimated Shannon-Whittaker interpolation error which in turn is determined partly by the A and B parameters that determine the shape of the grid. If the script has to double K to decrease the error to match, it will do it even if it only yields a few or even no extra isolated zeros and doubles the execution time. I would not be surprised if changing the combination of parameter values gives 10x or 100x more isolated zeros per cpu usage in the range of n that we have been checking.
Updated the table here with the results from testing #343. Now also included height n=10^23
and the relative accuracies. Will carry out a similar test for #344 after its merge is complete. EDIT: test of #344 is complete, benefits grows to >25%
with height.
(...) if changing the combination of parameter values gives 10x or 100x more isolated zeros per cpu usage in the range of n that we have been checking.
That would be truly amazing!
With J
quite strictly depending on height n
and the J-loop
-code nearing its optimal shape, it looks like parameter K
is the main lever to pull to further reduce run time (at greater heights). A quick test (using the #343 code) indicates that halving K
would improve run time of the J-loop
by another ~40% (but probably at the cost of increasing time of the other stages and/or the number of zeros isolated). Few run times (in secs) for _platt_smk only, prec=256, 1 thread
:
n K K/2
10^14 23 (40) 14 (20) A=8, B=4096, J=2579682, t0=22514484222485
10^16 364 (74) 212 (37) A=16, B=8192, J=24435419, t0=1941393531395154
10^18 3216 (70) 1844 (35) A=16, B=8192, J=226784107, t0=170553583898990072
A quick test (using the #343 code) indicates that halving K would improve run time of the J-loop by another ~40% (but probably at the cost of increasing time of the other stages and/or the number of zeros isolated).
This is interesting because it means that by running the j loop twice each with a different k range the peak memory use of the program could be cut in half at a cost of increasing the cpu usage by 20%. It could be a useful trade-off because for example in Platt's thesis "The value of N was the largest power of 2 that allowed the process to run in < 1 Gbyte of memory as this allowed us to make best use of the multi-core CPUs we had available to us." Well now that I think about it the second range of k might not be as efficient as the first range so maybe it would increase the cpu usage by more than 20%.
It would be great if you could add a zeta zeros program to the examples directory so that others can play with this easily.
This leaves open my suggestion in #320 to let the A=16-heuristics kick in at n=10^16 instead of at n=10^15. Any thoughts on that?
In https://github.com/fredrik-johansson/arb/pull/346 I've made this change. Both the A=8 and A=16 models are now interpolated in regions that overlap between n=1e7 and n=5e22. This has changed which parameter values are used for each height according to the tuning script, but I haven't actually run the local hardy z zeros function for heights that aren't covered by the unit tests. I'm curious how the local hardy z function behaves for each model in this interval and I wonder where is the best place to switch between models, in terms of number of zeros per cpu cost.
It would be great if you could add a zeta zeros program to the examples directory so that others can play with this easily.
That PR also introduces tested wrapper functions which could perhaps be used by an example program in the future.
I'm curious how the local hardy z function behaves for each model in this interval and I wonder where is the best place to switch between models, in terms of number of zeros per cpu cost.
I will run 10^7-10^22
for both models A=8, B=4096
and A=16, B=8192
requesting 40,000
and 150,000
respectively, and using prec=256
. Also will include the median relative accuracy of the zeros isolated.
In #346 I've made this change.
A small comment on the pull requests: in the documentation of acb_dirichlet_platt_zeta_zeros
it could be helpful to state explicitly that the user has the option of multi-threading through flint_set_num_threads(numthreads);
. Maybe this line could also be included in the wrappers (once they land in the example directory).
And I tried to be so careful that each thread gets exactly the same number of arb calculations in the dft multithreading
The DFT multi-threading code must have been done very elegantly, just watch how the threads get intertwined beautifully :-)
A small comment on the pull requests: in the documentation of acb_dirichlet_platt_zeta_zeros it could be helpful to state explicitly that the user has the option of multi-threading through flint_set_num_threads(numthreads);. Maybe this line could also be included in the wrappers (once they land in the example directory).
The PR is now updated to mention multi-threading in the zeta zeros docs.
Here is the data for the A=8
and A=16
models across the range 1e7..1e22
. I also included 1e4..1e6
.
Initial observations:
1e4
yields dramatically fewer zeros.1e16
, which seems to be caused by the multi-threading for the S_table
kicking in for J > 10mln
. Height 1e15
has J=8,047,580
, hence is excluded from multi-threading. Maybe relaxing the constraint on J
a bit would result in more steadily increasing timings?CPU-time / #zeros_found
, the A=8
model seems to beat A=16
up to 1e20
, however when taking the median relative accuracy into account as well, the A=16
model starts to win from 1e18
upwards.A=16
at heights >1e19
they become increasingly worse.Thanks for the data!
1e4 yields dramatically fewer zeros.
Yes, my reasoning is that the multieval strategy is probably not practical at such low n (compared to direct evaluation of zeta) but if it's fast then at least it can be used for unit tests. So if it finds only 1% as many zeros but runs in 2% of the time thereby bringing run time down from a minute to a second then I think it's worth it.
Maybe relaxing the constraint on J a bit would result in more steadily increasing timings?
In the PR I've now completely removed the constraint. The main reason for the limit was to prevent unnecessary memory usage, but that reason no longer applies now that the memory overhead of threads has been reduced.
Purely considering CPU-time / #zeros_found, the A=8 model seems to beat A=16 up to 1e20, however when taking the median relative accuracy into account as well, the A=16 model starts to win from 1e18 upwards.
In the PR I've moved the threshold from 2e15 up to 2e17.
The run times remain more or less the same when compared to version #344, however for model A=16 at heights >1e19 they become increasingly worse.
I've left this alone for now.
In the PR I've now completely removed the constraint. (..) In the PR I've moved the threshold from 2e15 up to 2e17
Great, after these changes it looks like this (prec=256, 64 threads/32 cores
):
relative acc (bits)
10^x #found cpu(s) wall(s) lowest median highest
4 24 0.335 0.296 24 58 69
5 1998 203.272 53.112 20 169 222
6 2566 211.309 59.422 22 170 222
7 3331 221.839 69.719 27 171 223
8 3851 222.739 74.851 33 173 223
9 4495 225.625 86.052 32 172 222
10 5153 229.959 95.071 37 173 223
11 5924 236.096 104.810 38 174 223
12 6665 240.604 113.533 43 175 223
13 7360 250.961 122.550 51 175 221
14 7862 277.876 126.116 50 163 204
15 8251 350.273 129.422 56 152 189
16 8494 575.500 137.822 60 141 173
17 8746 1220.52 148.359 63 130 157
18 14182 7601.58 492.587 64 177 220
19 15034 21885.4 733.784 66 177 219
20 15985 65780.2 1446.51 71 178 218
21 16866 195114 3510.46 74 178 217
22 17804 582717 9641.12 76 179 217
The unconstrained multi-threading yields some additional benefits and the kick-in point of the A=16
-model looks right now.
(...) In particular the way K is tuned right now, the tuning script tries to find K such that the estimated K truncation error matches the estimated Shannon-Whittaker interpolation error which in turn is determined partly by the A and B parameters that determine the shape of the grid. If the script has to double K to decrease the error to match, it will do it even if it only yields a few or even no extra isolated zeros and doubles the execution time. (...)
Your conjecture appears to be strongly supported by the data. I ran the latest version (#346) with K=K/2
. This doesn't yield any significant benefits for the A=8
-model, however when applied to the A=16
-model (>1e17
) it induces a significant improvement. There is a relatively low drop in the zeros isolated as well as in their accuracies, however this is compensated by a step change improvement in run times (up to 40%). Also tried K=K/4
, but that clearly is a step too far and would halve the number of zeros isolated.
Not sure how far you'd like to stretch the parameter tuning, but this one feels like 'low hanging fruit' for the A=16
model.
*) I took the top line of the 1e23
-data from here (PR #344). Since there is slight increase in timings at these heights after #346 and I didn't rerun 1e23
with #346 code, I augmented these particular run times by 3% (in the same ballpark as the increase observed at 1e22
).
At these large heights where A=16 is better than A=8 and the J loop cpu usage dominates the convolution cpu usage, I think that further increasing (doubling) B would probably also greatly increase the number of zeros per cpu cost but it would probably require re-tuning the other parameters and would increase the peak memory usage. I haven't checked this though.
I'm closing this because the issue in the title has been solved.
See https://github.com/fredrik-johansson/arb/issues/320#issuecomment-696254301.