PrincetonUniversity / SPEC

The Stepped-Pressure Equilibrium Code, an advanced MRxMHD equilibrium solver.
https://princetonuniversity.github.io/SPEC/
GNU General Public License v3.0
24 stars 4 forks source link

Profile SPEC performance #55

Open zhisong opened 5 years ago

zhisong commented 5 years ago

SPEC profiling

I ran ARM forge, a tracing/profiling tool for parallel computing. I ran it on both xspec and dspec. The test case is a LHD high beta case with 5 volumes, resolutions are M=10, N=8, Lrad=6. The test is based on the master branch 007be72, newest as date 6/12/2018. The input and output file, the overall results are attached as .txt files and .pdf files to this thread.

A break down of time calling each subroutines. Green is computation, blue is MPI (mostly waiting for other cpus to finish). The most time consuming subroutine is ma00aa (as expected), contributing 77%. image

Going into ma00aa, the most time consuming ones are radial integral ones. image

And on each clause, this is the break down. image

It seems a lot of time is spent on accessing memory.

xspec - Performance Report.pdf dspec - Performance Report.pdf finite_beta_V5_vmec1.sp.txt ddt_output.txt

jloizu commented 5 years ago

That's really great! That sets the first milestone towards SPEC speed optimization. Looks like what Stuart suggested, namely optimizing the volume integrals by e.g. exploiting symmetries, is the way to go.

lazersos commented 5 years ago

@zhisong This is great work and underlies my suspicions about the code from day one, specifically the rank 4 arrays are inhibiting vectorization. @jloizu while reducing the total number of floating point operations will help please note that in this analysis the issue is memory access. I think one key to speeding up the code is to vectorize the rank 4 arrays down to rank 2 arrays.

abaillod commented 5 years ago

Hello there,

I have been looking to SPEC performance as well in the framework of parallelization course at EPFL. I used Intel profiling tool (Vtune Amplifier) to check which routines were the most time consuming and I got the same results as you - about 50% of the time is spent is ma00aa.h (The test case is an axisymmetric torus, with M=4 and Lrad=16, 32 volumes, Lfindzero=1).

I tried to parallelize the volume integral loops with OpenMP. I was expecting substantial speedup since the loops are independent of each others... However I had to use some !$OMP CRITICAL around each call to the FFTW library to ensure each call was executed by one thread at a time, since apparently FFTW routines are not thread safe. I think it is possible to make them thread safe (see FFTW documentation http://www.fftw.org/fftw3_doc/Thread-safety.html) but I couldn't make it work for now. Any help on this would be appreciated!

When I run Vtune amplifier with the OpenMP version of ma00aa.h (with 4 threads, 1 MPI task), I get image So apparently lots of time is lost in

I attach at the end of this post a plot of a strong and weak scaling I did with the OpenMP version of ma00aa.h. You can see that there is almost no speedup. My guess is that threads are queuing to use the FFTW library.

StrWeakScaling_Hybrid

Please, tell me what you think about this and if you think it is worth it!

zhisong commented 5 years ago

@abaillod Thank you for your impressive progress.

After some search, it seems to me that !$OMP CRITICAL makes the threads queuing to use the FFTW library. The website you quoted does provide a solution to multi-thread programming with FFTW. Since the FFTW plan is the same for all threads, you may need to follow the suggestion in the link New arrays execute functions.

Other than OpenMP, I have another suggestion to speed up the computing by a factor of Nquad (the number of Gauss quadrature points), but we may need to completely rewrite the structure of metrix, coords, ma00aa and so on. We can discuss this "big" project when Stuart visits ANU.

zhisong commented 5 years ago

@abaillod I found another possibility that slowed down the calculation, that is, the LAPACK subroutines that inverts the matrices. If you are using Intel MKL libraries, you should be able to use OpenMP version of LAPACK by setting MKL_NUM_THREADS=something>1 to speed up.

For your problem concerning FFTW, you can put !$OMP PARALLEL commend after WCALL( ma00aa, metrix,( lvol, lss ) ), so that you are effectively parallelizing with respect to mn instead of radial coordinate.