flintlib / arb

Arb has been merged into FLINT -- use https://github.com/flintlib/flint/ instead
http://arblib.org/
GNU Lesser General Public License v2.1
455 stars 137 forks source link

Experiment with function acb_dirichlet_platt_local_hardy_z_zeros #320

Closed rudolph-git-acc closed 2 years ago

rudolph-git-acc commented 4 years ago

For a project, I like to compute the first 10,000 zeros of Z(t), starting at the n-th zero with n=10^k, k=5..20.

The function _acb_dirichlet_hardy_zzeros is feasible to use up to 10^15, however for higher k Platt's method is required and this has been accommodated in acb_dirichlet_platt_local_hardy_z_zeros. This function works impressively fast and I used it to successfully confirm the results up 10^15 and also managed to compute the required 10K zeros at n=10^16, 10^17 and 10^19.

However, for 10^18 and 10^20, the function frequently fails to produce the zeros or aborts the code with error message: "expected the first node in the list, Abort trap: 6". Example code that generates the error message is included below (n=10^18 + 200 and 100 zeros are at most computed).

I have tried varying the count size, the prec and experimented with different starting heights around 10^18, however nothing seems to fix the issues. I do realise the code has been developed by the 'the mysterious D.H.J. Polymath' and also that the many parameters required for Platt's method are quite 'delicate' to properly establish automatically, however I'd still be grateful for any thoughts on this.

#include "acb_dirichlet.h"
#include "flint/profiler.h"

int main()
{
    slong i, count, prec;

    count = 100;
    prec = 512;

    arb_ptr pa;
    pa = _arb_vec_init(count);

    fmpz_t n;
    fmpz_init(n);
    fmpz_set_str(n, "1000000000000000200", 10);

    printf("Platt_local_hardy_z_zeros....\n");

    acb_dirichlet_platt_local_hardy_z_zeros(pa, n, count, prec);

    for (i = 0; i < count; i++)
    {
        arb_printd(pa+i, 30); printf("\n");
    }

    fmpz_clear(n);

    _arb_vec_clear(pa, count);

    flint_cleanup();

    return 0;
}
p15-git-acc commented 4 years ago

Hi Rudolph! I'm sorry that the function isn't working in that range. I haven't looked at that code for a long time, but maybe I'll revisit it. I admit I'm a little demotivated to work on it further now that https://arxiv.org/pdf/1904.12438.pdf and especially https://arxiv.org/pdf/2004.09765.pdf have already been published. Also I notice now that Platt has uploaded his relevant arb code to his github at https://github.com/djplatt/code . He is the actual expert in this field so I feel kind of silly trying to work out what he is doing in his papers and putting it into arb, when he himself has already done all of this. Maybe he would be willing to submit a few of his functions of general interest upstream into arb?

rudolph-git-acc commented 4 years ago

@p15-git-acc Hi p15! No worries, you've coded some impressive additional functions for ARB.

The apparent parameter tuning issue in acb_dirichlet_platt_local_hardy_z_zeros seems to be a delicate one. The parameters at height 10^18+200 are:

image

and following a trial and error process, I have slightly changed each of them in isolation. Unfortunately I did not get the desired improvements. The strange thing is that the issues seem to start mid 10^17 and disappear in the 10^19 domain. But then reappear in the 10^20 range. Hope the issue is not caused by the code stumbling on some zeros lying off the critical line ;-)

I also had a quick look at Platt's code and I did get the impression (not sure) that his code has been optimised up to height 10^13 (maybe for the recent record verification of the RH?). Agree it would be great when some of his code gets incorporated into ARB.

p15-git-acc commented 4 years ago

Before getting back into the parameter tuning, I remember wanting a better upper incomplete gamma implementation.

That function was a speed bottleneck in the tuning search and I had to wrap it in a loop to get even one bit of accuracy. It was for Gamma(s, z) where s and z were both somewhat large (hundreds) and real-valued with s near z (less than like 5.0 abs difference). I guess this is called the 'transition region' and is known to be more difficult than regions in the tails. Should direct numerical integration be considered as an implementation option for upper incomplete gamma in arb alongside the several hypergeometric implementation options? I don't know if it would be faster or more accurate or if it would even be a good idea to use integration internally like that.

Also the incomplete gamma functions (along with regularization modes 0, 1 and 0, 1, 2 for upper and lower respectively) aren't on fungrim yet.

fredrik-johansson commented 4 years ago

Yes, it should be considered. Numerical integration is the easiest solution. For efficiency it would probably be best to use the uniform asymptotic expansions in https://doi.org/10.1090/mcom/3391 but AFAIK they don't come with error bounds.

There is a large-parameter expansion with error bounds in https://doi.org/10.1137/0510071 but I think it's not applicable in the transition region.

p15-git-acc commented 4 years ago

Hi Rudolph, thanks again for reporting your issues. I ran your minimal example code a few times (it takes an hour on my computer!) and I see that you can work around the problem by subtracting B/4 from T. This change is a tradeoff between the probability that the search will succeed vs. the number of zeros the search will find if it does succeed. Of course a better solution would be to improve the tuning parameter heuristics...

In https://github.com/fredrik-johansson/arb/pull/324 I've also fixed an issue that was causing the code to abort instead of returning the number of zeros found (0) when the search failed in the specific way that you encountered. That issue was also causing the code to check an incorrect condition as part of Turing's method.

rudolph-git-acc commented 4 years ago

Success! Subtracting B/4 from T induces a step change improvement. Many thanks for working this.

Already managed to generate the first 10K zeros at n=10^16-10^19 and will run 10^20 overnight (EDIT: successfully completed). The results at 10^16 are now also much more accurate (more digits computed) compared to the earlier runs. Being the lucky owner of an AMD Threadripper CPU with 32 cores, a multi-threaded program that splits the 10K into smaller chunks, does speed up computations considerably. Will also attempt the 10^21 domain next week to verify the results against Odlyzko's zeros.

Regarding the parameters, I noticed that variable J rapidly increases with the height chosen. E.g. at 10^20 it reaches a value just below 2x10^9. It appears that J is used in a loop from 1...J in the platt_smk(..)-function that is part of the platt_multieval.c-code. Since most CPUs have multiple cores these days, I wondered whether a multi-threaded approach to break down this loop into smaller pieces could help accelerate things. Not sure though whether this loop really is a 'critical path' process.

Out of curiosity, where does the upper limit of 3×10^22 for the validity of acb_dirichlet_platt_local_hardy_z_zeros originate from? Does Platt's method break down after this point or is it just that the parameters haven't been tuned yet beyond this height?

p15-git-acc commented 4 years ago

I'm glad that change works! I don't know what's going on with J it seems ridiculous to repeat a loop that many times. Honestly I don't remember exactly why I stopped tuning the parameters at 3x10^22, but maybe beyond that point it became impractical for me to run the function even once in one sitting. It's also possible that A=8 stopped working at that point? I liked having A and B as powers of 2 because it made parts of the code simpler, and moving to A=16 would mean re-tuning all of the other parameters from scratch rather than just perturbatively.

Thanks for pointing out that the Odlyzko zeros might be barely in reach of the arb zeta zeros implementation. That looks like a good test, or a motivation to extend the parameter tuning a little bit.

Edit: According to my notes Platt said J should be like the square root of the height, so if the height is like 10^20 then it makes sense that J would be like 10^10. My tuning procedure set J and K so that their associated truncation errors were comparable to the other errors in the calculation. Neither of those parameters were explicitly penalized for their size. I'm not sure how multithreading works within arb, but if it's available then it could be worth making a multithreaded variant of the zeta zeros function.

rudolph-git-acc commented 4 years ago

About the runs: Finished the run at 10^21. The good news is that the zeros found do match the Odlyzko zeros! The bad news is that the same issue as before has reappeared and not all zeros are being computed (output:zeros10e21.txt ). The 10^19 and 10^20 domains still ran well, although the 10^20 run showed some loss of digits. I ran the latter twice, with prec=512 and with prec=800, however that doesn't make any difference. Is there a similar trick possible by adjusting T by B a bit further? My gut feel is that the issue is somehow related to the increasing density of the zeros at larger heights. I'd be delighted to carry out any further test runs based on your suggestions.

About the 1...J loop: During the Polymath 15 project, I have actually used some multi-threading logic in ARB. Here is an example program that demonstrates how a large loop could be broken up into smaller pieces: examplecodemulithreading.txt. The code computes the sum of the ordinates of N zeros of the Zeta-function starting at a given height (useless of course, just needed a 'meaty' task). So, ./examplecodemultithreading 100 100000 6 computes the sum of the ordinates for the first 100 zeros starting at the 100001-th zero and uses 6 threads. Run times improve almost linearly with the number of physical threads a CPU supports.

p15-git-acc commented 4 years ago

When you say 'the same issue as before has reappeared' what exactly do you mean? Did the same abort message reappear?

I guess you knew this already, but the zeros function returns the number of zeros it found, and only this many entries in the output array are filled by the function call. The rest of the entries are untouched. If they are initially zero (as appears to be the case in your example) then they stay as zero. If they have garbage or reused numbers in them then these will also appear in the output if the function does not find enough zeros to overwrite that part of the array.

No real improvement should be possible by further adjusting T. Instead I think A and B need to be adjusted (in addition to all other parameters to compensate). The heuristic parameter selection currently has A hardcoded to 8. At large heights this is not enough. Testing and tuning larger parameter values (for example doubling A and B) is annoying because it would take the large part of a day to run even a single function call with one choice of parameter values.

Thanks for your multithreading example. I knew about multithreading in general, but I was wondering more about how it would work internal to arb, as part of the arb library itself, not how to use multithreading in a program that also uses arb. But I see there are some arb functions with a _threaded suffix that could be taken as examples. I also see that arb uses a flint_get_num_threads function internally. The J loop in particular is not completely straightforward to parallelize, because it writes to a giant array in a somewhat unpredictable fashion. If the computer has enough memory to allow each thread to get its own giant working array and then combine them at the end then it's ok.

rudolph-git-acc commented 4 years ago

To be precise: the abort message did not reappear. The only issues I encountered were that not all zeros are computed and the number of computed digits for the zeros that are found is reduced.

The testing/tuning at these heights is indeed going to take quite a bit of time, but I'd be happy to spend it. My dream goal would be to fully replicate Odlyzko's 10^22 run on a home computer (probably requiring many days running time) :-)

Thanks for the steer on the ARB-threading. I had actually overlooked the threaded versions of ARB-functions like e.g. arb_mat_mul_threaded. Browsing through its code, it looks quite straight forward. However, as you already mentioned, the output of the J-loop is the large S_table that has size N(=AxB) x K and is filled randomly.

A test run of the J-loop only script (with current A and B) at 10^18 takes 21 minutes to complete on my desktop (J=208339031). This amounts for ~70% of the overall time required to produce 1000 zeros. The length of the J-loop increases by a factor 3.1 with each power of 10, so I estimate that ~32-36 hours are required to complete the J-loop at 10^22 (J=18719904330). So, if my calculations are correct, it does seem quite worthwhile to 'multi-thread' the J-loop for these heights (and also for testing/tuning purposes).

After writing the above earlier today, I got curious and decided to take a try on the code in line with the _threaded ACB functions to test what would happen. Here is the code used: jloop.txt Note: platt_smk (containing the J-loop) and a few associated functions haven been 'carved out' from the platt_multieval.c-code to make the tests a bit simpler.

First insights: the benefits from multi-threading at height < 10^12 are very limited, however they do quickly improve with increasing heights. For instance to complete the J-loop at height 10^16 takes 270.6 seconds with 1 thread, 20.9 secs with 16 threads, however still 16.3 secs with 32 threads (using the current A and B values). There clearly is a trade-off here between the number of threads and the reconciliation work required to reconstruct the large S_table at the end. So, there appear to be diminishing returns from adding incremental threads, and there could even be a break-even point after which more threads work adversely. On the memory consumption: at 10^22 and A = 16, B = 8192, launching 32 threads claims ~7Gb of memory, so this doesn't look like a bottleneck. I did encounter one persistent issue: the line num_threads = flint_get_num_threads(); should give the number of threads available on a PC (I assume), however I can't make it work on any of my machines (it is always 1). Grateful for any thoughts on this. EDIT: solved, I had somehow erroneously expected that the number of available threads was established automatically for a CPU during Flint installation, however it simply requires putting flint_set_num_threads(n); at the beginning of the code and you're done. A threaded function then picks it up with num_threads = flint_get_num_threads();.

p15-git-acc commented 4 years ago

Here's the flint/arb information I could find on multithreading usage and conventions: http://flintlib.org/doc/threading.html https://github.com/wbhart/flint2/blob/trunk/code_conventions.txt#L133 http://arblib.org/issues.html#thread-safety-and-caches

That's good news that the table data doesn't require excessive memory even when each thread has its own table. The multithreading is straightforward in that case.

rudolph-git-acc commented 4 years ago

I have written a multi-threaded version of platt_multieval.cand made various runs with it. The improved timings have made exploring the larger heights much more tractable now. I have used 64 threads (of which 32 physical), kept A=8 andB=4096 and experimented with prec, number of zeros requested, and subtracting B/4 from T or not. Results are in the table below:

image

Observations: to obtain sufficient digits for the zeros found, prec = 256 looks ok for heights>=10^16 ... <=10^22. The #zeros requested does influence the outcome. So, running e.g. 10^18 in 10 batches of 1000 zeros works, whereas requesting 10000 zeros in one run fails. The adjustment T=T-B/4 is already quite beneficial at 10^16, however its effect fades away at greater heights and a full re-tuning of the parameters indeed seems unavoidable from 10^20 onwards.

As input for the re-tuning, I made this summary of the parameters at the various heights (A=8, B=4096, Ns_max=200) :

image

As you suggested before, we should probably start by putting A=2x8, B=2x4096 ?, to sustain the coding advantages of the power of 2. The question now is in which direction we should be moving the other parameters to compensate for that. Happy to do further "tuning"-runs based on your advice.

P.S.: 1) Parallelising the J-loop to produce the S_table has had a significant impact on the run times. Wonder if the follow-up processes, like the actual DFT-steps could be multi-threaded as well? On this page it is mentioned that FLINT does use multi-threading for its FFT commands.

2) Since the number of zeros requested does influence how many zeros are actually generated, I probably need to rerun the scripts a couple of times to reach the desired 10000 zeros in batches of say 1000. Therefore wondered whether the S_table could be generated only once and then be re-used for the next batches of 1000. The A, B, J, K parameters don't seem to change at all and parametert0 (=T) only changes a little bit (this could be a problem though, since T represents the integer at the center of the evaluation grid...).

p15-git-acc commented 4 years ago

Here's a summary of my parameter tuning notes for the local zeros algorithm that I dug up from last year.

The general tuning strategy was very ad hoc. Some parameters were fixed, some subsets of parameters were jointly optimized to minimize error, and some parameters were optimized so that their induced errors were relatively small compared to those of other parameters. The parameters were tuned for the vicinity of the nth zero where n was a power of ~sqrt(10) and the tuned values for smaller n were used as initial guesses for larger n. Notice that this strategy does not involve actually computing zeros, although this was done as a final post-tuning check (I see that this check was done only up to 10^18).

Here's how that worked for specific parameters. A, B, prec, and Ns_max were fixed (not automatically tuned). The Whittaker-Shannon interpolation parameters H and sigma_interp were then jointly optimized to minimize estimated interpolation error. The h, sigma_grid, J, and K parameters were then jointly optimized to minimize estimated grid error, subject to the constraints that the J and K truncation errors were comparable in size to the interpolation error.

If I remember correctly the optimization wasn't really well behaved. There may have been some misleading local optima so that parameter value x was better than x+dx or x-dx but worse than say x+100dx. Some of the parameters may have been correlated so that changing only one at a time gives only negligible improvement but changing both at the same time could give large improvement. Some of the parameter values affect the evaluation time. There were problems with incomplete gamma evaluation as part of the error calculation, but I hope this will be fixed with https://github.com/fredrik-johansson/arb/pull/326 and https://github.com/fredrik-johansson/arb/pull/291.

Therefore wondered whether the S_table could be generated only once and then be re-used for the next batches

I don't think this is possible for that table in particular, but the general idea of reusing calculations across evaluation grids is definitely possible. That was planned to be implemented as acb_dirichlet_platt_hardy_z_zeros.

rudolph-git-acc commented 4 years ago

Many thanks for sharing the tuning strategy. Optimising so many (mutually dependent) parameters sounds like a very delicate process.

In the mean time I have experimented further with the parameters at height 10^21 and found a much faster way to carry out subsequent runs at this height. Firstly, fix the height, prec, Ns_max and parameters A, B, J, K, T (I picked K=26, kept the others unchanged, also A & B) and produce and store the S_table once. With the pre-stored S_table, each experimental run now only takes 35 secs to complete for 100 requested zeros. Next step is to 'play' with the other parameters H, sigma_interp, h, sigma_grid and check their impact on the accuracy of the zeros produced. Following this process I managed to boost the accuracy of the zeros found at 10^21 to 11-14 digits using parameters: H=2.68, sigma_interp=23, h=166, sigma_grid=1899. Odlyzko-zero (~6 digits): ...538.4980696 , ARB-zero (~13 digits) ...538.4980696317331.

I immediately acknowledge this is just the result of a trial & error approach and it could just be a local optimum. However, I do wonder whether the optimisation process for these 4 parameters could somehow be automated this way. With the multithreaded approach, producing S_tables for a range of integer values of K (or J) is easy to do and the feedback loop after individual parameter changes is only 35 secs. The first target should be to generate some zeros at all, and when this has been accomplished, their accuracy could then be improved iteratively.

p15-git-acc commented 3 years ago

Here are some parameters if you want to try n near 1e22. These were found by searching for small error terms without actually computing the zeros or even making that table. I don't know if it would actually work.

n = 10^22
A = 16
B = 8192
J = 19957244503
K = 63
sigma_grid = 3901
sigma_interp = 19
prec = 400
h = 175.5
H = 0.8125
Ns_max = 300
t0 = 1370919909931995308226
rudolph-git-acc commented 3 years ago

Huge success! These parameters yield all the desired 10,000 zeros at an amazing 74 - 87 (!) digits accuracy. Checked a couple of them against the Odlyzko zeros and they all match. Actually gave me goosebumps when I saw these results :-)

EDIT: output 10^22: zeros10e22.txt. I also managed to generate 25,000 zeros in a single 'batch', where the accuracy starts at ~87 digits (first zero) and drops to ~7 (last zero).

For the run, I could only use 32 threads with these parameters since memory consumption went up to 25.3 Gb (32Gb available). Running time was ~7.7 hours. Will now also try 3x10^22 to check if this set still works at the currently specified upper limit of acb_dirichlet_platt_local_hardy_z_zeros (EDIT: that didn't work, it was a wild shot anyway. This clearly would require further tuning).

rudolph-git-acc commented 3 years ago

Quick recap of this thread and possible next steps to close the issue:

Upper incomplete gamma function Should now be fixed with #326 and #291. No work required in platt_local_hardy_z_zeros.c or platt_multieval.c.

Multi-threading Multi-threading the J-loop in platt_smk(..) in platt_multieval.c delivered significant speed improvements > 10^15. I believe it would be beneficial to incorporate this in ARB. This should be pretty straightforward to code and here is a draft version of it. Should a dedicated _threaded-version be created or not? The Flint guidelines appear to discourage such an extension:

The user should not have to run specialised versions of functions to get threading. This means that user facing functions should generally not have _threaded appended to their name. Either there is a single function that does the job, and it happens to be threaded, or there is a best-of-breed function that calls the appropriate threaded function when this is the best strategy.

There are some instances where it may be desirable (e.g. for testing purposes, or because naming proves difficult) where one wants a _threaded in the name. But these cases should be rare.

A _threaded-version would also imply having to maintain two versions of basically the same code. Also other programs that 'call' the threaded code like platt_local_hardy_z_zeros.c would need to be changed. Therefore leaning towards having a single version, however I prefer to leave this decision and its final coding in the hands of the true ARB-experts.

Further fine-tuning I am already very satisfied with all the 10K-datasets generated up to 10^22. However, I realise the current code hasn't been fully optimised yet. It works really well up to n=10^18 and this could now simply be extended to n=10^20 by subtracting B/4 from T. For n=10^22, a dedicated set of parameters has been established and that induced unequalled results.

As a next step, I could imagine repeating the n=10^22 parameter tuning for a few more n as powers of ~sqrt(10) between say n=10^19 - n=10^22. And then for all relevant parameters to extend the current second and third degree polynomials to interpolate the intervals between these n. Is this a viable path forward to close the issue at hand? I'd be happy to do any further test runs at these heights!

p15-git-acc commented 3 years ago

I just want to clarify that when you referred to 'digits of accuracy' a couple posts above, you meant digits of accuracy after the decimal place, right? As in, if you saw a number like 1370919909931995308226.627511878006490223 +/- 2.2058e-2 then you would say it has 2 digits of accuracy?

rudolph-git-acc commented 3 years ago

Correct!

p15-git-acc commented 3 years ago

As a next step, I could imagine repeating the n=10^22 parameter tuning for a few more n as powers of ~sqrt(10) between say n=10^19 - n=10^22. And then for all relevant parameters to extend the current second and third degree polynomials to interpolate the intervals between these n. Is this a viable path forward to close the issue at hand? I'd be happy to do any further test runs at these heights!

If you'd like to help work on this, I've uploaded a tuning script https://github.com/p15-git-acc/tune-params. It has a lot of room for improvement. The reason for some of its roughness is that it's left over from when the incomplete gamma function was less reliable and I was trying to track down the problem with the resulting bad error bounds.

rudolph-git-acc commented 3 years ago

Thanks for this very powerful tuning tool! I have made many runs with it today and can confirm that with the fixed parameters A=16, B=8192, Ns_max=300, prec = 400, one could successfully generate 10,000 zeros at heights n=1x10^17-1x10^21, with a stunning ~80-90 digits accuracy. Here are the advised parameters used and the run times achieved (32 threads): tuning parameters.txt. Note that sigma_interp remains flat at 19 and the other parameters quite steadily increase/decrease.

For n=1x10^18, I have also tried various runs with A=8, B=4096 and/or prec=256, however the best results achieved with these parameters induced a 'meagre' 7-15 digits accuracy (although I could produce the 10,000 zeros in a single batch run now. That didn't work before).

Given the superior accuracy that could be achieved, the question is at which height should the doubled A,B 'kick in'? Is this an overkill for heights < 10^19? It obviously comes at the cost of longer running times, which cannot be fully mitigated by increasing the number of threads (threaded S_tables of size Kx(AxB) now take more memory and resource limits will be hit sooner). Also, not all users of acb_dirichlet_platt_local_hardy_z_zeros will have a large number of threads available.

On the other hand, the user could get access to a fully tuned function that generates the required zeros at an unequalled high accuracy. The user could still lower prec; e.g. at n=1x10^18, lowering to prec=256 reduces accuracy to 40-48 digits and, using 32 threads, run time for 10,000 zeros improves from 1216 to 990 secs. The multieval-function also remains available for the user and all individual parameters could still be selected/tuned as desired.

Seems there is a trade-off to be made between speed and quality.. Do you have any thoughts/preferences for this?

You did mention there is still room for improvement in the tuning tool, so would it be helpful when I already start making more detailed parameter estimating runs like f.i. (1...9)x10^20, or would you prefer to do further optimisations on the tool first?

p15-git-acc commented 3 years ago

Given the superior accuracy that could be achieved, the question is at which height should the doubled A,B 'kick in'? Is this an overkill for heights < 10^19?

Seems there is a trade-off to be made between speed and quality.. Do you have any thoughts/preferences for this?

There are complicated trade-offs among speed, memory usage, zeros accuracy, the number of zeros that are likely to be isolated in a single batch, and the probability of failing to isolate the first requested zero thereby causing the batch to fail to report any zeros. I think that where possible the interface should follow the arb design of giving increasingly accurate results if the user keeps increasing prec, implying that if everything else is equal then the doubled A and B should kick in when the user increases prec enough. Currently the prec passed to acb_dirichlet_platt_local_hardy_z_zeros is interpreted as a working precision, but I think it might be better to take it more seriously as a target accuracy (in the sense of arb_rel_accuracy_bits) for at least some of the zeros because this would help guide trade-off decisions.

By the way, I updated some stuff in that tuning repo. In particular it now tries to separate the precision that it uses to compute error bounds from the prec that it prescribes for the zeros calculation. In an example I tried, this sped it up from like 40 minutes to 1 minute. I don't know if the formula connecting the two precs is good. It's more of a placeholder.

EDIT: A step towards improving the policy might be to always subtract the B/4 from T (this means that half of the zeros that are possible to find inside the multieval grid are discarded, but it has other benefits as you have seen) and then use the working precision, A, B, and other parameters that are predicted to yield the cheapest high-precision isolated zeros. By high-precision I mean having arb_rel_accuracy_bits at least as great as the prec requested by the user and by cheapest I mean the least amount of cpu usage per high-precision zero. Making these predictions is outside the scope of the parameter tuning script. In particular the script doesn't directly model the number of isolated zeros, the accuracy distribution of the isolated zeros, or the cpu usage.

p15-git-acc commented 3 years ago

Is this a viable path forward to close the issue at hand? I'd be happy to do any further test runs at these heights!

I've submitted pull requests for multi-threading (https://github.com/fredrik-johansson/arb/pull/327) and further fine-tuning (https://github.com/fredrik-johansson/arb/pull/328). I suppose it would be helpful if you'd like to test out the tuning one, because I only checked it at a few heights up to at most n=1e18. It's artificially limited to 1e22 although I'd guess it could go somewhat higher without further tuning if you feel ambitious. The tuning repo is also updated.

rudolph-git-acc commented 3 years ago

With the parameters from yesterday's version of the tuning app, I generated zeros10e23.txt in only 18.25 hours :-)

The run times for the tuning tool have now been dramatically been improved. Is it actually your intent to integrate this tool into acb_dirichlet_platt_local_hardy_z_zeros to automatically establish the parameters at any height? The extra overhead of running the tool as the first step is relatively small (and improves with height) and you could get rid of all the interpolation polynomials. Also, this way the spec of the function could be extended to 1x10^25 (my new ambition). Maybe a mixed solution is also possible, so the current interpolation up to 10^16 and an incorporated tuning tool for all values above that?

Just tried the latest version of the tuning tool and it works great! It ran well up to 10^16. Above that I got the message: 1*10^16 **interpolation error has only -13 relative accuracy bits. And as expected, after increasing the default *prec this was solved. I observed that from 10^17-10^21, the doubling ofA & B never 'kicked in'. So, just to be sure: is the idea to 'call' function _get_params_A16_n1e17 for these heights and recompile?

I'll do more test runs (after removing the line fmpz_add_ui(T, T, B/4) from acb_dirichlet_platt_local_hardy_z_zeros).

p15-git-acc commented 3 years ago

Yeah sorry when I wrote "I've submitted pull requests for multi-threading (#327) and further fine-tuning (#328). I suppose it would be helpful if you'd like to test out the tuning one" I meant testing the second of those two pull requests not the tuning script itself. Thanks for testing the tuning script though. I can't reproduce the low accuracy interpolation error you found, and I don't understand this because the code should be deterministic. This is possibly a bug, and maybe an important one.

I observed that from 10^17-10^21, the doubling of A & B never 'kicked in'. So, just to be sure: is the idea to 'call' function _get_params_A16_n1e17 for these heights and recompile?

Just so I'm clear what happened, you modified the code and recompiled to extend the iteration range to include 10^17-10^21 but you didn't modify the code to change which _get_params function was called? In that case it did what I would have expected. The script only optimizes a subset of parameters conditional on fixed A, B, Ns_max, so it never changes those fixed parameters during the run.

rudolph-git-acc commented 3 years ago

To be precise: there are 3 ranges in the tuning code: _get_params_A8_n1e4, _get_params_A16_n1e17, _get_params_A16_n1e22. In the int main() iteration section I have selected one of those for the height(s) I was planning to use and then recompiled. The only parameter I changed in the range-section is *prec (to avoid the error message).

I was actually just staring to test 10^17-10^19 and observed that *prec has to be already put at 512 to avoid the error message: 1*10^17 *interpolation error has only -5 relative accuracy bits (when *prec=256). This is strange, since I have done successful runs with just prec=256at these heights.

p15-git-acc commented 3 years ago

I opened issue https://github.com/p15-git-acc/tune-params/issues/2 on the parameter tuning script repo regarding the low relative accuracy interpolation error that you are finding.

rudolph-git-acc commented 3 years ago

I think there is a possible easy explanation why it doesn't work at my end. The interpolation error sections calls:

acb_dirichlet_platt_bound_C3(

and the latter uses the updated incomplete gamma function, that hasn't been installed yet on my machine. I will check!

EDIT: I have re-installed ARB from the GITHUB checkout and that fixes the issue. Apologies for the confusion caused.

p15-git-acc commented 3 years ago

Is it actually your intent to integrate this tool into acb_dirichlet_platt_local_hardy_z_zeros to automatically establish the parameters at any height?

I don't think this should be necessary. I've updated https://github.com/fredrik-johansson/arb/pull/328 with heuristics up to n=1e37 which should be enough for anybody, as they say. The A=8 heuristics started failing hard near 1e22 where perhaps not coincidentally the local density of zeta zeros begins to exceed 8 (A is a sampling rate and the zeta zeros density might be analogous to twice a function 'frequency' and Nyquist theorems start applying). If that was the main problem then A=16 should work beyond n=1e37. And if finding zeros near n=1e23 takes a day then n=1e37 should take thousands of years, if the iteration count J is like the square root of the height.

p15-git-acc commented 3 years ago

With the parameters from yesterday's version of the tuning app, I generated zeros10e23.txt in only 18.25 hours :-)

If you are looking to make more comparisons, here are some more tables of zeros I found on the internet https://people.math.osu.edu/hiary.1/amortized.html

rudolph-git-acc commented 3 years ago

Many thanks for the link to the zero tables. I had not seen these results before and they went up pretty high! The zeros are per T and not by n, so using the heuristic that the n-th zero resides between Grampointsn-2 and n-1 will probably do the conversion.

I have included your latest tuning polynomials in acb_dirichlet_platt_local_hardy_z_zeros (it cleans the code up nicely) and made various runs with it from 10^15 - 10^21 (10^22 runs overnight). Got good outcomes for 10^18-10^21. With 'good' I mean: 10,000 zeros generated in a single batch at an accuracy of 25-50 digits (prec=256) or 70-90 digits (prec=400). As before 10^15 - 10^17 did not deliver the 10,000 zeros in one go, but these heights could be done in batches of 1,000 (increasingprec doesn't help here). When I let the A=16, B=8192 heuristics kick in at these heights, I actually do get batches of 10,000 zeros in one go and superior accuracy (that also is sensitive to prec). Or is the increased Ns_max playing a role here?

As you said in an earlier post, there is a trade-off to be made between many user requirements, and maybe this is just the best optimum. However, it does feel a bit counter intuitive that the higher one goes, the more zeros a run yields and also their accuracy improves. To me it would feel a bit more natural when already at 10^15 the A=16, B=8192 heuristics are used (also since changing prec currently doesn't help at these heights). But hey, in the end it is a compromise between the many user interests.

p15-git-acc commented 3 years ago

As before 10^15 - 10^17 did not deliver the 10,000 zeros in one go, but these heights could be done in batches of 1,000 (increasing prec doesn't help here). When I let the A=16, B=8192 heuristics kick in at these heights, I actually do get batches of 10,000 zeros in one go

How are cpu and memory usage affected, especially cpu time per isolated zero with and without letting the A=16 heuristics kick in?

p15-git-acc commented 3 years ago

To me it would feel a bit more natural when already at 10^15 the A=16, B=8192 heuristics are used

OK I've changed this in the PR. Could you run the code in the pull request at few different heights to make sure it behaves as expected?

fredrik-johansson commented 3 years ago

Sounds like great progress :-)

I merged the PR as it was. Hope it's not a hassle to open a new PR if you're going to make further changes.

fredrik-johansson commented 3 years ago

By the way, I think these are the current record height computations: https://people.maths.bris.ac.uk/~jb12407/data/zeta/

rudolph-git-acc commented 3 years ago

Tested the (now committed) version. It runs like a clock and behaves as expected!

Below are a few insights from the tests and the potential need for a final tweak.

1) Increasing prec has a positive impact on both the number and the accuracy of the zeros found. Accuracy does decrease with height for the same A,B-heuristics (data from a run on an iMac). At lower heights, generating batches of 10,000 zeros in one go, looks impractical. However, running multiple, smaller batches is not an issue at these heights.

height  prec    #req   #found   accuracy first zero              run time
10^4    128 10000   1010    1.0711e-24 +/- 0          42.321
10^4    256 10000   1707    3.09412e-63 +/- 0         66.618
10^4    512 10000   2263    5.56475e-108 +/- 0       104.484
10^5    128 10000   1184    3.20818e-24 +/- 0         42.125
10^5    256 10000   1998    9.23229e-63 +/- 0         68.570
10^5    512 10000   2644    1.5495e-107 +/- 0            112.124
10^6    128 10000   1475    4.30948e-23 +/- 0         43.853
10^6    256 10000   2566    1.23939e-61 +/- 0         73.703
10^6    512 10000   3384    7.68958e-105 +/- 0       127.385
10^7    128 10000   1867    1.34837e-21 +/- 0         47.797
10^7    256 10000   3299    3.83394e-60 +/- 0         83.551
10^7    512 10000   4034    5.89843e-90 +/- 0        134.506
10^8    128 10000   2158    2.27152e-21 +/- 0         48.064
10^8    256 10000   3863    6.41654e-60 +/- 0         86.643
10^8    512 10000   4663    1.523e-84 +/- 0              141.175
10^9    128 10000   2464    6.6401e-19 +/- 0          50.144
10^9    256 10000   4511    3.80763e-57 +/- 0         96.452
10^9    512 10000   5216    2.826e-77 +/- 0              152.535
10^10   128 10000   2731    5.88069e-19 +/- 0         51.205
10^10   256 10000   5181    1.66538e-57 +/- 0                101.016
10^10   512 10000   5842    2.35676e-72 +/- 0        155.448

2) Multithreading benefits only kick in at 10^16and then quickly increase (data from a run on Threadripper with 64 threads):

height  prec    #req   #found   accuracy first zero     accuracy last zero      run time
10^4    256 1000    1000    3.09412e-63 +/- 0   2.87828e-41 +/- 0   cpu/wall(s): 84.883 84.883
10^5    256 1000    1000    9.23229e-63 +/- 0   9.9721e-47 +/- 0    cpu/wall(s): 85.575 85.577
10^6    256 1000    1000    1.23939e-61 +/- 0   1.41788e-52 +/- 0   cpu/wall(s): 84.625 84.626
10^7    256 1000    1000    3.83394e-60 +/- 0   3.54926e-54 +/- 0   cpu/wall(s): 84.824 84.824
10^8    256 1000    1000    6.41654e-60 +/- 0   4.81934e-56 +/- 0   cpu/wall(s): 82.162 82.163
10^9    256 1000    1000    3.80763e-57 +/- 0   8.84425e-56 +/- 0   cpu/wall(s): 82.214 82.214
10^10   256 1000    1000    1.66538e-57 +/- 0   2.18708e-55 +/- 0   cpu/wall(s): 78.252 78.253
10^11   256 1000    1000    7.71212e-57 +/- 0   6.30674e-55 +/- 0   cpu/wall(s): 76.565 76.565
10^12   256 1000    1000    1.02337e-56 +/- 0   1.12065e-53 +/- 0   cpu/wall(s): 75.465 75.466
10^13   256 1000    1000    1.29244e-55 +/- 0   1.01973e-53 +/- 0   cpu/wall(s): 80.97 80.97
10^14   256 1000    1000    5.91524e-48 +/- 0   2.64135e-47 +/- 0   cpu/wall(s): 100.998 100.999
10^15   256 10000   10000   5.61808e-51 +/- 0   3.43314e-15 +/- 0   cpu/wall(s): 902.609 902.618 (!)
10^16   256 10000   10000   3.05387e-50 +/- 0   2.43336e-19 +/- 0   cpu/wall(s): 2237.47 779.65 <-
10^17   256 10000   10000   7.57681e-50 +/- 0   3.36667e-23 +/- 0   cpu/wall(s): 4879.55 804.00 <-
10^18   256 10000   10000   7.51644e-49 +/- 0   2.66984e-26 +/- 0   cpu/wall(s): 13769.3 921.303 <-
10^19   256 10000   10000   1.25878e-47 +/- 0   1.37701e-28 +/- 0   cpu/wall(s): 41769.4 1357.76 <-
10^20   256 10000   10000   2.21536e-46 +/- 0   1.32904e-29 +/- 0   cpu/wall(s): 124883 2710.03 <-

3) The data tables above and below indicate that the A=16-heuristics appear to kick in just one order too soon (my bad). The run times now make the expected jump at 10^15, however the user can't get any multi-threading benefits yet to compensate for this. Therefore as a final tweak, it looks more 'fair' to put the starting point of the A=16 heuristics at 10^16instead (at the small cost of having to run 10,000 zeros at 10^15 in multiple, probably 4, batches). The data table below illustrates how high the user can get with the 'cheapest' prec=128 and a 1,000 zeros requested (data from a run on Threadripper with 64 threads):

height  prec    #req   #found   accuracy first zero     accuracy last zero      run time
10^4    128 1000    1000    1.0711e-24 +/- 0    0.00994576 +/- 0    cpu/wall(s): 52.679 52.68
10^5    128 1000    1000    3.20818e-24 +/- 0   1.65243e-08 +/- 0   cpu/wall(s): 52.807 52.807
10^6    128 1000    1000    4.30948e-23 +/- 0   4.93012e-14 +/- 0   cpu/wall(s): 52.444 52.445
10^7    128 1000    1000    1.34837e-21 +/- 0   1.24825e-15 +/- 0   cpu/wall(s): 52.388 52.389
10^8    128 1000    1000    2.27152e-21 +/- 0   1.7061e-17 +/- 0    cpu/wall(s): 51.666 51.667
10^9    128 1000    1000    6.6401e-19 +/- 0    2.36305e-17 +/- 0   cpu/wall(s): 50.74 50.74
10^10   128 1000    1000    5.88069e-19 +/- 0   7.72287e-17 +/- 0   cpu/wall(s): 48.933 48.935
10^11   128 1000    1000    2.72337e-18 +/- 0   2.22708e-16 +/- 0   cpu/wall(s): 47.938 47.938
10^12   128 1000    1000    3.61083e-18 +/- 0   3.95404e-15 +/- 0   cpu/wall(s): 47.768 47.769
10^13   128 1000    1000    1.51708e-17 +/- 0   1.19724e-15 +/- 0   cpu/wall(s): 50.961 50.962
10^14   128 1000    1000    1.32599e-14 +/- 0   5.92302e-14 +/- 0   cpu/wall(s): 65.655 65.656
10^15   128 1000    1000    1.9963e-12 +/- 0    7.29072e-12 +/- 0   cpu/wall(s): 392.446 392.453 (!)
10^16   128 1000    1000    1.08385e-11 +/- 0   2.62213e-10 +/- 0   cpu/wall(s): 1554.57 324.852
10^17   128 1000    1000    2.73024e-11 +/- 0   2.19685e-10 +/- 0   cpu/wall(s): 3955.49 373.303
10^18   128 1000    1000    2.68387e-10 +/- 0   2.33356e-09 +/- 0   cpu/wall(s): 11046.6 480.394
10^19   128 1000    1000    4.40412e-09 +/- 0   4.82805e-08 +/- 0   cpu/wall(s): 33754.5 842.208
10^20   128 1000    1000    7.78998e-08 +/- 0   7.32881e-08 +/- 0   cpu/wall(s): 98039.9 1900.06
p15-git-acc commented 3 years ago

Thanks for checking the merged code and for compiling those tables! I've opened issue https://github.com/fredrik-johansson/arb/issues/335 regarding the multi-threading threshold.