etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

threading #83

Closed urbach closed 11 years ago

urbach commented 12 years ago

we need to develop a threaded version of the code, in particular for the BG/Q

kostrzewa commented 12 years ago

We should have a longish skype conference / f2f meeting to plan this because we will need to change a lot of stuff and we will have to make some decisions regarding placement of parallel sections etc..

urbach commented 12 years ago

yes, I agree, this topic needs discussion. But we have to push it a bit...

kostrzewa commented 12 years ago

Hmm, at least without half spinors or gauge copies it took me all of 5 minutes to write a working threaded version of Hopping_Matrix (3 lines of code...) And without SSE of course...

kostrzewa commented 12 years ago

Hmm, at least without half spinors or gauge copies it took me all of 5 minutes to write a working threaded version of Hopping_Matrix (3 lines of code...) And without SSE of course...

Although it doesn't work in a mixed mpi/openmp setup... I wonder why though!

kostrzewa commented 12 years ago

Although it doesn't work in a mixed mpi/openmp setup... I wonder why though!

Actually, it was just very slow because of the massive overhead (2 processes, 4 threads/process)

kostrzewa commented 12 years ago

c2a8345ca875f17483c7edf6381c7745888f798b is the test implementation, --disable-halfspinor, --disable-sse*, compile with gcc necessary (or change -fopenmp to -openmp in configure.in to compile with icc)

kostrzewa commented 12 years ago

I must be doing something wrong because the performance is truly abysmal. Or maybe the overhead of creating and destroying threads this often is simply that large. Alternatively, I could be enticing a lot of cache misses because this implementation happily writes to the same output spinor, albeit at different locations in memory. I don't know how OpenMP handles this internally and what sort of impact on performance it would have. It might make sense to implement a reduction and work on local data only except for the boundaries.

kostrzewa commented 12 years ago

Interestingly this seems to be fairly machine-dependent. Testing 10 iterations of sample-hmc0, 8^3 x 16:

I see why it appeared to be so slow this afternoon. When run in hybrid mode (2 mpi with 2 threads each), it's about a factor of 10 slower. (maybe this is just true when the N mpi processes share the same memory controller, it could differ when the mpi processes are on physically different machines)

Out of interest to get some type of preliminary picture I will check these numbers on the zeuthen workgroup server which is a machine with 8 cores and 16 SMT.

kostrzewa commented 12 years ago

I would cautiously say that I have good news to report regarding a future OpenMP integration. Locally, my test executable built from the current omp branch https://github.com/kostrzewa/tmLQCD/compare/omp with parallelisation added for a number of functions, has performance parity on my machine with the MPI build. I wasn't able to test larger parallelisations because the workgroup servers are under heavy load. My guess would be though that OpenMP should outperform MPI at some point.

Of course, my approach was the most naive possible and adding OpenMP to the SSE / BG code will be more difficult, mainly for indexing reasons, but I think we have a good skeleton to build on. Of course, I still need to answer the question regarding hybrid performance... In addition, I have not made this optional currently.

Further, one could imagine doing something with update_gauge and update_momenta too, although the "max" reduction will be problematic because OpenMP does not support this natively.

kostrzewa commented 12 years ago

More good news! On my laptop, which is a dual-core, 4 hardware SMT machine, I roughly get performance parity with hybrid MPI / OpenMP. The trick is in not exceeding the number of hardware SM threads because otherwise the context switching slows stuff down by a factor of 10 or so.

It remains to be seen how this scales on, say, a 12 core machine capable of running 24 threads. Or, for that matter, what sort of impact it will have on the BG/Q.

kostrzewa commented 12 years ago

And another, final update for today:

As the WGS, an 8 core, 16 SMT machine, has freed up I was able to run some tests there. MPI with 8 processes leads the pack but 4MPI_2OMP is only 2-3% slower (4_4 is 5-7% slower). As I chose too small a lattice to begin with I was not able to test 16 MPI processes. Of course, these tests completely disregard true communication overheads as they are run on one machine. Another caveat is that so far, I've only tested hmc0.

kostrzewa commented 12 years ago

Further, one could imagine doing something with update_gauge and update_momenta too, although the "max" reduction will be problematic because OpenMP does not support this natively

Indeed, this would have to be done manually, keeping track of NUM_THREADS individual maxima and determining the maximum between them after the parallel operation.

urbach commented 12 years ago

sounds all very good...

the "max" operation is needed only for output, its not vital and could be left out

kostrzewa commented 12 years ago

Yes, I was surprised at how simple it was (first time I use OpenMP, usually only used glib-/p- threads in applications that were much less linear than tmlqcd).

But there's a bit of subtlety when it comes to making this work for all versions of the hopping matrix because of the way some of the field pointers are defined. I might need a hand there.

Also, a question that arose early on, because I tried with the SSE version first and failed, was whether there are any problems doing inline asm from multiple threads. My guess is that each core has one set of SSE registers and that we should be fine as long as we don't use more than one thread per core and I make all pointers private by removing their static keywords.

Finally, making it optional will result in a lot more messy ifdefs than we have already but I guess we don't really have a choice there.

the "max" operation is needed only for output, its not vital and could be left out

But it's quite relevant for tuning the timescales correctly, no?


My main question at this point is whether I should go ahead and attempt a "final" implementation or whether this still requires discussion / profiling. My approach would be to modify each function as one git commit, thereby allowing us to easily revert the changes in case we feel that a given parallelisation is more overhead than it's worth. (I'm thinking, for instance, about VOLUME to VOLUMEPLUSRAND loops) although I guess that in general anything that takes more than ~0.1s to complete should be parallelized.

As the common lattice size is now quite respectable, I could even imagine pushing the writing of gauge configurations out into a separate, short-lived thread which would do I/O while we can get on with computing. (as long as we're not about to do a reversibility check) I don't know whether the context switches inherent in this would lead to an overall performance loss though. It's certainly possible.

kostrzewa commented 12 years ago

I have done some profiling today (gprof and some scalasca) which indicates that the thread management overhead is really minimal (0.5% of runtime or less). One important point that has sprung up is the threading of the I/O operations. On my machine, on a dinky 8^3x16 lattice almost 15% of the time is spent writing gauge-fields (LIME) and waiting for this operation to finish...

kostrzewa commented 12 years ago

Ok, I've forged ahead anyhow and have managed to implement an OpenMP version of the first Hopping_Maitrix (halfspinor, sse) which shows off just how much #ifdef boilerplate is added to these routines: fcd2d97ca31184da535e28007b90c9c129d7dc32 . Is this acceptable? I've managed to preserve the form of the non-OMP code (except for the removal of two _prefetch_su3(U) calls, which I think were not doing anything because they are promptly replaced bu _prefetch_su3(U+predist) )

If all of this is OK and I can continue at the current pace I should have a testable version by the end of next week.

kostrzewa commented 12 years ago

Trying to test the BGP version of the multithreaded code (doing mpidimension=3, mpirun -mode SMP, OMPNumThreads 4) but I don't think my job will ever run... there are HCH02* jobs dating back to the beginning of February. @urbach If this doesn't run soon, would you be able to do a test to confirm that my changes didn't break anything?

kostrzewa commented 12 years ago

Ok, I'm at least getting interactive jobs, I guess I can do the testing myself.

urbach commented 12 years ago

Hey Bartek, sorry, I don't have time to follow all this really closely, but please go ahead! Use llrun on jugene to try your implementation. If you need an configure example, you can find it here /homea/hch02/hch026/bglhome/head/bgp4dim

kostrzewa commented 12 years ago

Don't worry about it Carsten, we all know how time comes and goes. I might need some help at some point with the BG/[P,L] routines though because I seem to have broken them with OpenMP.

kostrzewa commented 12 years ago

@etmc I'm trying to fix the 2nd Hopping Matrix (BGL, halfspinor) for OpenMP but I'm failing to spot what could be wrong with it. I've tried removing the restrict keywords just from the gauge fields, then from all fields but this doesn't change anything.

What I see happening is that in the first step the det derivative has NO force contribution (as though the spinors were all zero) and then the CG does not converge, indicating some bug somewhere in the hopping matrix. I've excluded all other parallelized parts because number 6 works. (BGL, NO halfspinor)

If someone has the time, I would really appreciate a read through the 2. Hopping_Matrix in my omp-final branch. Thanks.

kostrzewa commented 12 years ago

I did some benchmarks of the 6th Hopping_Matrix on Jugene. Results on our wiki

kostrzewa commented 12 years ago

Ok, I found the last bug in the 2nd Hopping_Matrix. It works now but the hybrid version is really slow (compared to the 6th hopping matrix)

In fact, I seem to be seing that the halfspinor version is slower in general than the non-halfspinor version. Does that make sense? (Including 5.1.6 pure mpi on 1024 VN's)

kostrzewa commented 12 years ago

Interesting development: Florian is trying to run on HLRN and for finite-temperature the threaded version seems to outperform the pure mpi version by about 20%. It is unfortunate that the shmem version has not been tried there. I would assume it would be much faster.

kostrzewa commented 12 years ago

This is the current status of the threaded code: https://github.com/kostrzewa/tmLQCD/compare/omp-final

A lot of changes but I've been very careful to check the implementation with every commit (usually equivalent to the modification of one function or one set of related functions). I have NOT done long-statistics runs. These are necessary because I parallelize a few Kahan summations (turning them into OMP_NUM_THREADS independent Kahan summations), the result of which is added together naively. Given that OMP_NUM_THREADS should, for the foreseeable future, be a number < 100; I think this should have a negligible effect on rounding errors.

As far as I can tell I've parallelised all available VOLUME loops, but I could have missed some loops which are delimited by N (although I grep'ed for those too). I did not do the shmem version of the Hopping_Matrix yet, although I hope to at least write that soon. I also didn't touch the blocked versions because I currently don't really understand them.

Performance seems good on Intel, excellent on the Hannover shmem machine if the correct number of threads is chosen and quite disappointing on JuGene, although that was to be expected.

I will write a guide regarding the correct compiler settings and possible optimisation pathways. I will also write a report (probably due mid- to end-May) about the strategy and the implementation details.

One major area of possible improvement is the choice of scheduler for the loops. This, however, is highly machine dependent and I don't think I have the time to test all possible combinations right now.

kostrzewa commented 12 years ago

As a typical example of the 80:20 problem I have left many of the routines in linalg unparallelized and am now doing the incredibly repetitive labor of going through them to increase the parallelisation factor. Again, I'm doing this in such a way that each function corresponds to one descriptive commit so that they can be reverted if we find that any of them have more overhead than speed-up. Hopefully I'll be done by tomorrow but there's still testing to be done!

urbach commented 11 years ago

I guess we can close this now...!