PrincetonUniversity / SPEC

The Stepped-Pressure Equilibrium Code, an advanced MRxMHD equilibrium solver.
https://princetonuniversity.github.io/SPEC/
GNU General Public License v3.0
25 stars 6 forks source link

Speedup of ma00aa #14

Closed lazersos closed 7 years ago

lazersos commented 7 years ago

@SRHudson has made significant improvements to the speed in ma00aa. Please pull and test to see if it works for your setup. My initial test showed the code only takes ~15s in ma00aa for the l2stel case.

jloizu commented 7 years ago

That sounds very promising and exciting! I will perform thorough tests today, including profiling. I want to test speedup (compared to before), proportion of cpu time consumed by subroutines (compared to before), and also I want to compare the results of running the l2stell test case with the "slow" and "fast" versions, in order to see differences.

jloizu commented 7 years ago

My first test on the IPP-cluster (nproc=2) for the l2stell case took:

18sec with the "fast-version" of ma00aa branch 29sec with the "slow-version" of master branch which is quite nice! Hopefully this scales up with increasing Fourier resolution (I will test this).

Unfortunately the ma00aa branch does not have the profiling options. I would suggest that we first merge the Makefile of master into the Makefile of branch. Or I can commit a change in the Makefile of ma00aa branch, such that it is a replica of that of master. Then I can make profiling on this ma00aa branch.

ONLY after some tests we should finally merge this branch into master.

Does that sound good?

jloizu commented 7 years ago

Interesting results from profiling on the l2stell testcase: the percentage of time taken by the routine ma00aa.h has been reduced by a factor of 2, thanks to the modifications done by Stuart (and Sam)!

Below the detailed list of cpu-time-% for each routine, for the "slow"(master) and "fast"(ma00aa) versions. Seems like now optimizing ma00aa.h and matrix.h could be having comparable speedup effects. Before we merge, let me do some more tests.

"SLOW" branch % time 74.18 ma00aa 10.42 matrix 3.12 rzaxis 2.85 mp00ac 2.05 coords 1.96 dforce 1.69 tr00ab 1.25 volume 1.07 tfft 0.62 metrix 0.36 invfft 0.18 preset 0.09 packab 0.09 lforce 0.09 brcast

"FAST" branch % time 35.98 ma00aa 29.78 matrix 7.20 rzaxis 5.71 mp00ac 5.46 dforce 3.97 coords 3.47 volume 2.48 tfft 2.48 metrix 1.74 tr00ab 1.24 invfft 0.50 bfield

jloizu commented 7 years ago

I have performed a comparison between the two outputs of these two runs (which correspond to the master and ma00aa branches). The comparison is simply calculating

D = max( Rmn1 - Rmn2) over all m,n and over all interfaces

namely the maximum difference in the interface geometry, stored in the hdf5 output file.

I got D = 5.9e-14, which is very good, but not machine precision. Since the two sources are presumably doing the same calculations (in the sense that if we were to write down the algorithm of the code, it would be the same), I was expecting D<1e-15.

Sam argues that this may be due to the optimization of the compiler, which rearranges memory flows in different ways depending on how the algorithm is coded. Also, he says that the fact that there are square roots may increase the rounding error.

If that is the case, how do we define at which value of D we are satisfied (confident that the results are "the same")? For these type of modifications that leave the algorithm unchanged, I would have expected to set D=1e-15 as a confidence threshold.

Please let me know what are your thoughts on this.

jloizu commented 7 years ago

I just learned that even the simple task of adding a sequence of finite precision floating point numbers can potentially generate an numerical (relative) error that increases with either n, sqrt(n), or independent of n, depending on the algorithm used for summation.

An interesting algorithm that ensures no error amplification in such a task is the so-called Kahan summation algorithm, see https://en.wikipedia.org/wiki/Kahan_summation_algorithm

However, even if applying such a compensated summation algorithm, different compiling strategies can lead to the suppression of this advantageous error-scaling.

Therefore, I think I understand why two differently compiled algorithms could produce different results.

jloizu commented 7 years ago

I had an idea of how to qualify whether two results are "the same" or not.

Observation 1: the two branches I compared gave an absolute maximum difference in geometry Dabs = 1e-14 and a relative maximum difference in geometry Drel = 1e-9, therefore the two outputs are not the same "exaclty".

Observation 2: if, however, I take the output geometry of one run and use it as initial guess for the other run, the code considers that the system is in equilibrium and force imbalance still f~1e-15.

Possible strategy: consider that two outputs are "the same" if force-balance is preserved.

Possible way of quantifying this without additional runs: the force is essentially proportional to [[B^2]], and B^2 is roughly proportional to the metric g, which is roughly proportional to R^2. Thus

df = (df/dR)dR ~ R dR

which means that if we find that R*dR < 1e-16, then the difference in geometry, dR, is irrelevant since it preserves force-balance. For the example I examined, that gives

RdR = 1e-51e-14 = 1e-19 < 1e-16

which anticipates that both equilibria are "the same equilibrium".

Let me know if that makes sense to you.