etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
31 stars 47 forks source link

Interleaved communication / computation #77

Open kostrzewa opened 12 years ago

kostrzewa commented 12 years ago

Given the current situation, I'm in the process of getting a bit more familiar with the MPI parallelisation of tmLQCD and the exchange routines in particular. One thing that strikes me, and please correct me if I'm wrong, is that we implement non-blocking communication but don't really make use of the possibility of interleaving communication and computation because the Waitall's make the xchange routines as a whole blocking ones. Was any profiling ever done how much time is spent idling during Waitalls?

urbach commented 12 years ago

interleaving communication and computation would require to rewrite the geometry in tmLQCD. For a long time most MPI implementation didn't support interleaving.

The non-blocking communication is written for the BlueGene, where it is helping because you can communicate in all directions at once using all physical connections. This gives the speedup. Moreover, the 4th direction can be done while waiting for the others.

kostrzewa commented 12 years ago

I could imagine there would be some speedup from the current scheme even on machines with less elaborate networks, no? Because the communication can be scheduled in packets rather than the hardware having to schedule an unpredictable flurry of single communication requests.

Indeed, I'm aware that one would also have to rewrite many other parts of the code because every computation would have to be done in two steps:

Depending on how much time is spent in those waitalls interleaving could have some serious performance implications on any type of hardware but especially on machines with "slow" networks. But currently I wouldn't even know how to approach implementing this in tmLQCD short of rewriting pretty much everything in one way or another.

urbach commented 12 years ago

I would really suggest to concentrate currently on other, much more important things. But I thought quite a bit about it some time ago and know how one could do it.

kostrzewa commented 12 years ago

No, of course. I just wanted to write down what I noticed while trying to understand the parallelisation.

kostrzewa commented 12 years ago

Ok, I'm getting there, almost ready to help figure out what's going wrong in your additions for xchange_deri.

Just so I don't forget I will jot this down here:

The xchange routines can be made fully non-blocking by taking a slight performance hit and carrying out two-hop communication for edge exchange. Indeed it may be that this could even be faster because we would avoid multiple waitalls to resolve dependencies and the hardware would be able to schedule the communication in the most efficient way possible in a given moment, taking any possible path to obtain the required edge field.

It would require complicating the g_nbD[up,dn] objects into arrays or simply adding some more like g_nb_t_up_x_dn (although the array is probably a more forward-looking implementation in case we ever end up with bigger loops in the action)

urbach commented 12 years ago

I though about this option of directly communicating the "edges". I didn't do it yet because e.g. on the BG such non-next-neighbour communications are way slower than next-neighbour communications.

BTW: I am not sure that I find the index independent implementation clearer and easier...

kostrzewa commented 12 years ago

I though about this option of directly communicating the "edges". I didn't do it yet because e.g. on the BG such non-next-neighbour communications are way slower than next-neighbour communications.

But technically we do non-next-neighbour communication anyway, but manually by pulling in the boundary of a next-neighbour. in addition, and this depends on how smart the mpi scheduler really is, we waste time doing more waitalls than necessary during which the machine could be doing all that non-next neighbour communication. I mean, getting an edge should be guaranteed to be a two hop communication right? (because of the mapping of the lattice to the network topology)

urbach commented 12 years ago

keep in mind that for every xchange_deri (or xchange_gaue) the Dirac operator is applied thousands of times and also xchange_field...