Separate parallelization logic from Dslash classes

martin-ueding commented 6 years ago

All four Dslash classes have very similar paralleization and communication logic in them, various #pragma omp directives and hundreds of lines which only do array and thread index calculations. This is completely independent of the actual physical Dirac operator (that perhaps is a better name for the Dslash classes). The merge of devel into hacklatt-strongscale branch showed that the identical changes were made for Wilson and clover and Dslash and achimbdpsi, therefore this code should go somewhere else.

One of the reasons that the hacklatt-strongscale branch was not merged in four months ago supposedly was that it does not improve the performance in all situations, right? So what we really need here is that one can simply exchange the messaging model from the old queues to the hacklatt-strongscale model, perhaps by interchanging the concrete implementation of an interface (abstract base class).

This refactoring would make it much easier to port the TM Wilson and TM clover operators to the new communication model. Right now the quick-fix would be to re-do the changes in more methods:

TM Wilson Dslash
TM Wilson achimbdpsi
TM clover Dslash
TM clover achimbdpsi

Since this is a major change, we should land all other feature branches before we do so to avoid painful merges.

kostrzewa commented 6 years ago

As far as I can tell, the strongscale branch is faster in a few situations, but slower in many others. Also, at least in my tests, there were lots of deadlocks so it's certainly not ready to replace the current comms model. Splitting forward and backward face completion is probably a good idea though in any case.

The other problem is that the dslash and the communication code get intertwined in complicated ways when you want to ensure full overlap of computation and communication. Requiring the availability of both receive queues (which are great on many machines, as far as I can tell) and the ability to have a single or some thread(s) explicitly progress the comms (by spinning on MPI_Wait) makes the abstraction even harder to come up with (not impossible though).

Another difficulty is that performance improvements will almost certainly require moving thread dispatch further up in the hierarchy. This in turn intertwines thread and MPI barriers, which is another aspect which needs to be taken care of.

In my offload_comms branch, I've moved thread dispatch to the operator level (outside of dslash and dslashAChiMinusBDPsi). I have one situation for which I was able to improve performance by more than 30% at the far end of the strong-scaling window on KNL+OPA (Marconi A2). However, in some other situations I get (mild) performance regressions. I also still have unpredictable crashers, probably because I need another MPI barrier in the operator.

bjoo commented 6 years ago

Hi Bartek, Martin, It has been a while since I looked at the comms strategies. The queue based approach was added by @tkurth and was very useful in fabrics where messages could arrive in any order. A downside was that now binary reproducibility could not be guaranteed since the ordering of arithmetic on the corners was no longer fixed. The other aspect was that we lost a lot of performance just processing 1 face at a time. So the strongscale ( or actually more the nesap-hacklatt-strongscale ) branch went back to pre message queues. In this instance there was a strict ordering on the comms directions and we could check that forward and backward faces have both arrived in a given direction. Then I tried to split the face processing so that 1/2 of the threads did forward and the other 1/2 did the back face. Dealing with faces pairwise also eliminated races on the corners. I didn't remember getting many deadlocks, but there was a particular set of races updating some direction coming from multiple threads potentially trying to access the same vector. E.g when the X-dir was only 1 SOA long.

I would think that this 'strong scale' approach is maybe less optimal in networks where there is a higher chance of messages arriving out of send order.

I would support redesigning this as suggested also in a Slack chat by Martin. I am mostly done with qphix hacking to help my MG code now and most new optimization for that will likely appear in the mg-proto package at the coarser levels. So refactors of QPhix will not step on my toes. I am en route home from a meeting. Let me check all my mg_mods are checked in and merged with Devel. I may be able to do this between flights or by noon tomorrow (EDT).

Thanks very much for all your continued work with QPhiX and with best wishes, Balint

⁣Dr Balint Joo, Scientific Computing Group, Jefferson Lab, Tel: +1 757 269 5339 (Office) Sent from my mobile phone

On Nov 2, 2017, 5:52 AM, at 5:52 AM, Bartosz Kostrzewa notifications@github.com wrote:

As far as I can tell, the strongscale branch is faster in a few situations, but slower in many others. Also, at least in my tests, there were lots of deadlocks so it's certainly not ready to replace the current comms model. Splitting forward and backward face completion is probably a good idea though in any case.

The other problem is that the dslash and the communication code get intertwined in complicated ways when you want to ensure full overlap of computation and communication. Requiring the availability of both receive queues (which are great on many machines, as far as I can tell) and the ability to have a single or some thread(s) explicitly progress the comms (by spinning on MPI_Wait) makes the abstraction even harder to come up with (not impossible though).

Another difficulty is that performance improvements will almost certainly require moving thread dispatch further up in the hierarchy. This in turn intertwines thread and MPI barriers, which is another aspect which needs to be taken care of.

In my offload_comms branch, I've moved thread dispatch to the operator level (outside of dslash and dslashAChiMinusBDPsi). I have one situation for which I was able to improve performance by more than 30% at the far end of the strong-scaling window on KNL+OPA (Marconi A2). However, in some other situations I get (mild) performance regressions. I also still have unpredictable crashers, probably because I need another MPI barrier in the operator.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_qphix_issues_97-23issuecomment-2D341370217&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=SC-qvz5njMoFH6cliT5XZQ&m=q-HWnpGk4pbVMbJmhcCjwE83bClxc87l90a-FNb7ipc&s=FpPFQl_36euHOwVBL2tHV3YrloD4cNlrARQdjR4kFg0&e=

JeffersonLab / qphix

Separate parallelization logic from Dslash classes #97