Closed urbach closed 11 years ago
We should have a longish skype conference / f2f meeting to plan this because we will need to change a lot of stuff and we will have to make some decisions regarding placement of parallel sections etc..
yes, I agree, this topic needs discussion. But we have to push it a bit...
Hmm, at least without half spinors or gauge copies it took me all of 5 minutes to write a working threaded version of Hopping_Matrix (3 lines of code...) And without SSE of course...
Hmm, at least without half spinors or gauge copies it took me all of 5 minutes to write a working threaded version of Hopping_Matrix (3 lines of code...) And without SSE of course...
Although it doesn't work in a mixed mpi/openmp setup... I wonder why though!
Although it doesn't work in a mixed mpi/openmp setup... I wonder why though!
Actually, it was just very slow because of the massive overhead (2 processes, 4 threads/process)
c2a8345ca875f17483c7edf6381c7745888f798b is the test implementation, --disable-halfspinor, --disable-sse*, compile with gcc necessary (or change -fopenmp to -openmp in configure.in to compile with icc)
I must be doing something wrong because the performance is truly abysmal. Or maybe the overhead of creating and destroying threads this often is simply that large. Alternatively, I could be enticing a lot of cache misses because this implementation happily writes to the same output spinor, albeit at different locations in memory. I don't know how OpenMP handles this internally and what sort of impact on performance it would have. It might make sense to implement a reduction and work on local data only except for the boundaries.
Interestingly this seems to be fairly machine-dependent. Testing 10 iterations of sample-hmc0, 8^3 x 16:
I see why it appeared to be so slow this afternoon. When run in hybrid mode (2 mpi with 2 threads each), it's about a factor of 10 slower. (maybe this is just true when the N mpi processes share the same memory controller, it could differ when the mpi processes are on physically different machines)
Out of interest to get some type of preliminary picture I will check these numbers on the zeuthen workgroup server which is a machine with 8 cores and 16 SMT.
I would cautiously say that I have good news to report regarding a future OpenMP integration. Locally, my test executable built from the current omp branch https://github.com/kostrzewa/tmLQCD/compare/omp with parallelisation added for a number of functions, has performance parity on my machine with the MPI build. I wasn't able to test larger parallelisations because the workgroup servers are under heavy load. My guess would be though that OpenMP should outperform MPI at some point.
Of course, my approach was the most naive possible and adding OpenMP to the SSE / BG code will be more difficult, mainly for indexing reasons, but I think we have a good skeleton to build on. Of course, I still need to answer the question regarding hybrid performance... In addition, I have not made this optional currently.
Further, one could imagine doing something with update_gauge and update_momenta too, although the "max" reduction will be problematic because OpenMP does not support this natively.
More good news! On my laptop, which is a dual-core, 4 hardware SMT machine, I roughly get performance parity with hybrid MPI / OpenMP. The trick is in not exceeding the number of hardware SM threads because otherwise the context switching slows stuff down by a factor of 10 or so.
It remains to be seen how this scales on, say, a 12 core machine capable of running 24 threads. Or, for that matter, what sort of impact it will have on the BG/Q.
And another, final update for today:
As the WGS, an 8 core, 16 SMT machine, has freed up I was able to run some tests there. MPI with 8 processes leads the pack but 4MPI_2OMP is only 2-3% slower (4_4 is 5-7% slower). As I chose too small a lattice to begin with I was not able to test 16 MPI processes. Of course, these tests completely disregard true communication overheads as they are run on one machine. Another caveat is that so far, I've only tested hmc0.
Further, one could imagine doing something with update_gauge and update_momenta too, although the "max" reduction will be problematic because OpenMP does not support this natively
Indeed, this would have to be done manually, keeping track of NUM_THREADS individual maxima and determining the maximum between them after the parallel operation.
sounds all very good...
the "max" operation is needed only for output, its not vital and could be left out
Yes, I was surprised at how simple it was (first time I use OpenMP, usually only used glib-/p- threads in applications that were much less linear than tmlqcd).
But there's a bit of subtlety when it comes to making this work for all versions of the hopping matrix because of the way some of the field pointers are defined. I might need a hand there.
Also, a question that arose early on, because I tried with the SSE version first and failed, was whether there are any problems doing inline asm from multiple threads. My guess is that each core has one set of SSE registers and that we should be fine as long as we don't use more than one thread per core and I make all pointers private by removing their static keywords.
Finally, making it optional will result in a lot more messy ifdefs than we have already but I guess we don't really have a choice there.
the "max" operation is needed only for output, its not vital and could be left out
But it's quite relevant for tuning the timescales correctly, no?
My main question at this point is whether I should go ahead and attempt a "final" implementation or whether this still requires discussion / profiling. My approach would be to modify each function as one git commit, thereby allowing us to easily revert the changes in case we feel that a given parallelisation is more overhead than it's worth. (I'm thinking, for instance, about VOLUME to VOLUMEPLUSRAND loops) although I guess that in general anything that takes more than ~0.1s to complete should be parallelized.
As the common lattice size is now quite respectable, I could even imagine pushing the writing of gauge configurations out into a separate, short-lived thread which would do I/O while we can get on with computing. (as long as we're not about to do a reversibility check) I don't know whether the context switches inherent in this would lead to an overall performance loss though. It's certainly possible.
I have done some profiling today (gprof and some scalasca) which indicates that the thread management overhead is really minimal (0.5% of runtime or less). One important point that has sprung up is the threading of the I/O operations. On my machine, on a dinky 8^3x16 lattice almost 15% of the time is spent writing gauge-fields (LIME) and waiting for this operation to finish...
Ok, I've forged ahead anyhow and have managed to implement an OpenMP version of the first Hopping_Maitrix (halfspinor, sse) which shows off just how much #ifdef boilerplate is added to these routines: fcd2d97ca31184da535e28007b90c9c129d7dc32 . Is this acceptable? I've managed to preserve the form of the non-OMP code (except for the removal of two _prefetch_su3(U) calls, which I think were not doing anything because they are promptly replaced bu _prefetch_su3(U+predist) )
If all of this is OK and I can continue at the current pace I should have a testable version by the end of next week.
Trying to test the BGP version of the multithreaded code (doing mpidimension=3, mpirun -mode SMP, OMPNumThreads 4) but I don't think my job will ever run... there are HCH02* jobs dating back to the beginning of February. @urbach If this doesn't run soon, would you be able to do a test to confirm that my changes didn't break anything?
Ok, I'm at least getting interactive jobs, I guess I can do the testing myself.
Hey Bartek, sorry, I don't have time to follow all this really closely, but please go ahead! Use llrun on jugene to try your implementation. If you need an configure example, you can find it here /homea/hch02/hch026/bglhome/head/bgp4dim
Don't worry about it Carsten, we all know how time comes and goes. I might need some help at some point with the BG/[P,L] routines though because I seem to have broken them with OpenMP.
@etmc I'm trying to fix the 2nd Hopping Matrix (BGL, halfspinor) for OpenMP but I'm failing to spot what could be wrong with it. I've tried removing the restrict keywords just from the gauge fields, then from all fields but this doesn't change anything.
What I see happening is that in the first step the det derivative has NO force contribution (as though the spinors were all zero) and then the CG does not converge, indicating some bug somewhere in the hopping matrix. I've excluded all other parallelized parts because number 6 works. (BGL, NO halfspinor)
If someone has the time, I would really appreciate a read through the 2. Hopping_Matrix in my omp-final branch. Thanks.
I did some benchmarks of the 6th Hopping_Matrix on Jugene. Results on our wiki
Ok, I found the last bug in the 2nd Hopping_Matrix. It works now but the hybrid version is really slow (compared to the 6th hopping matrix)
In fact, I seem to be seing that the halfspinor version is slower in general than the non-halfspinor version. Does that make sense? (Including 5.1.6 pure mpi on 1024 VN's)
Interesting development: Florian is trying to run on HLRN and for finite-temperature the threaded version seems to outperform the pure mpi version by about 20%. It is unfortunate that the shmem version has not been tried there. I would assume it would be much faster.
This is the current status of the threaded code: https://github.com/kostrzewa/tmLQCD/compare/omp-final
A lot of changes but I've been very careful to check the implementation with every commit (usually equivalent to the modification of one function or one set of related functions). I have NOT done long-statistics runs. These are necessary because I parallelize a few Kahan summations (turning them into OMP_NUM_THREADS independent Kahan summations), the result of which is added together naively. Given that OMP_NUM_THREADS should, for the foreseeable future, be a number < 100; I think this should have a negligible effect on rounding errors.
As far as I can tell I've parallelised all available VOLUME loops, but I could have missed some loops which are delimited by N (although I grep'ed for those too). I did not do the shmem version of the Hopping_Matrix yet, although I hope to at least write that soon. I also didn't touch the blocked versions because I currently don't really understand them.
Performance seems good on Intel, excellent on the Hannover shmem machine if the correct number of threads is chosen and quite disappointing on JuGene, although that was to be expected.
I will write a guide regarding the correct compiler settings and possible optimisation pathways. I will also write a report (probably due mid- to end-May) about the strategy and the implementation details.
One major area of possible improvement is the choice of scheduler for the loops. This, however, is highly machine dependent and I don't think I have the time to test all possible combinations right now.
As a typical example of the 80:20 problem I have left many of the routines in linalg unparallelized and am now doing the incredibly repetitive labor of going through them to increase the parallelisation factor. Again, I'm doing this in such a way that each function corresponds to one descriptive commit so that they can be reverted if we find that any of them have more overhead than speed-up. Hopefully I'll be done by tomorrow but there's still testing to be done!
I guess we can close this now...!
we need to develop a threaded version of the code, in particular for the BG/Q