JeffersonLab / qphix

QCD for Intel Xeon Phi and Xeon processors
http://jeffersonlab.github.io/qphix/
Other
13 stars 11 forks source link

Merging strong scaling branch #91

Open azrael417 opened 7 years ago

azrael417 commented 7 years ago

The current level branch still uses receive queues. We found that they do not perform optimally in the strong scaling limit. We should merge the strong scaling branch modifications into devel. The issue is that this merge cannot be done automatically because the interface changed a lot after restructuring. Also, the strong scale branch has a coarse grained openmp model, so that makes the whole thing more tricky. Once that is done, I could play around with background progression of comms using OMP tasks or similar tricks.

martin-ueding commented 7 years ago

I presume the changes are mostly in the former DyzPlus and DPsiPlus functions? Those have not really changed much, one should be able to resolve the merge conflicts by mapping the changes in DyzPlus to Dyz and DPsiPlus to DPsi.

At the moment I'd like to focus on writing for my thesis. Perhaps I can take a look at this merge eventually, but please go ahead yourself you want it in quickly.

martin-ueding commented 7 years ago

Which branch is the correct one? I guess the first one?

bjoo commented 7 years ago

Hi Martin, It would be jlab/nesap_hacklatt_strongscale

(this is actuallty a merge of optimize-strongscale wihch undoes the receive queues, and does in-order message reception — which brings back with it binary reproducibility). It then splits threads working over forward and backward faces in each direction.

Best, B

On Jul 18, 2017, at 5:00 AM, Martin Ueding notifications@github.com wrote:

Which branch is the correct one? I guess the middle one?

• jlab/feature_receive_queues (already merged into devel) • jlab/nesap_hacklatt_strongscale (Early 2017) • jlab/optimize-strongscale (Mid 2015) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 6 years ago

I am now merging this branch into devel, it is really frustrating. In the meantime a lot of things have changed:

Doing a git merge devel turned out to be a waste of time. The merge algorithm would usually use the latest change in a given line. Since we have used clang-format on the whole codebase, this sometimes was older code which was to be replaced. Even -X ignore-all-whitespace did not make it go through well enough. So I just merged the files without conflict in and now I am applying all the made changes manually to the new codebase again. This is a tedious task and also error prone. But I fear that there is/was no other way to do it because the branches have diverged so far.

It still does not compile, and even when it does, the tests will probably fail. I'll keep working on it.

martin-ueding commented 6 years ago

Also these changes only benefit the Wilson and Wilson clover case. The twisted mass cases are not changed, therefore they will not see any improvement. Of course one could now try to do the same changes there as well, it would be just another four functions to change ({TM Wilson, TM clover} × {dslash, achimbdpsi}). The code is extremely similar, so I would say that refactoring this in order to separate the parallelization logic from the physical operator would make things much easier down the road. However, it would require some refactoring.

martin-ueding commented 6 years ago

For some reason, the Wilson Dslash does not work any more. Perhaps there something with the is_plus is wrong, I don't know. Other than that it will be a fast-forward merge into devel now.

For today I'm done, perhaps I can finish this in the next couple of days.

martin-ueding commented 6 years ago

Did the tests on the nesap_hacklatt_strongscale go through in March? Or was there something not working? It does not build on Travis CI because it was forked from devel before that was included in there. So I would have to set up an Autotools build again to test it.

martin-ueding commented 6 years ago

The tests now work on my machine™, I did some mistake when manually merging, namely a parallel section within another parallel section. The Travis CI test on branch manual-hacklatt-strongscale also succeeds.

One needs to test whether these still give the performance than they did in March. I have merged this manually and since the unit tests go through, it should at least give the correct result. However, I might have introduced some mistake with the OpenMP directives such that performance is not as expected. Could you please verify that the strong scaling still is improved and then merge manual-hacklatt-strongscale into devel?

Could we in the future please try to avoid these long-running feature branches and try to get features merged into devel within a few weeks? If it is an experimental feature, we can still merge it in but guard it with a feature switch (#ifdef perhaps). This way it will be easier to refactor the code and ensure that cool features do not end in a hazzle when merging?

kostrzewa commented 6 years ago

@martin-ueding

Could we in the future please try to avoid these long-running feature branches and try to get features merged into devel within a few weeks?

This simply doesn't work in science in my experience... Work like this is done, often somewhat carelessly, in short spurts of a few days of activity (at hackathons, for example) and then necessarily lingers for months or even a year (conferences, teaching duties, physics work...). The prudent thing to do is to never make cosmetic changes to the whole code-base (such as reformatting all code) unless one is absolutely sure that all feature branches have been merged, as annoying as this may seem. Of course, in our case we also introduced a metric ton of new features into QPhiX...

Excellent work on merging this, it must have been extremely challenging. Now we still need to check whether it's functionally correct in all circumstances though and check performance. In particular, it would be important to make these features optional (if possible), if they affect performance on few nodes substantially. Most of the time, however, we do work in the strong scaling limit with local lattices of the order of 8^3*16 or so...

@azrael417 I would be interested to help in looking into background progression using tasks once this has landed.

martin-ueding commented 6 years ago

This simply doesn't work in science in my experience...

My mindset with software development is usually that we are a software “company” and that we want to have the best product for our end-users. However, our product are the papers that we write with the data that we incidentally generate with our software. So I need to make sure that devel is rather conservative such that these long-running branches are less of a problem.

The prudent thing to do is to never make cosmetic changes […] as annoying as this may seem.

I am almost always falling into the boy-scouts pattern (leave the site cleaner than you found it) and not into the spec-ops pattern (make the needed changes and get out), therefore I rewrite comments, fix the formatting using clang-format or improve on something else. I'll try to keep the branches more focused on a single issue.

it must have been extremely challenging

Mostly irritating because git got thoroughly confused that the strong-scale changes were done in two copies of the functions (DPsiPlus and DPsiMinus) but on devel there was only one copy of these functions. So the merge copied all the code that I had removed back in and then it had merge conflicts reaching over function scopes. git mergetool with gvimdiff (or vimdiff) is a help there because you get four windows (common base, our, theirs, merge result).

These changes are probably worthwhile to also have in the twisted mass implementations. One could change the two Dslash classes as well, but I wouldn't feel too good about that from a software engineering point of view. It seems that the parallelization is independent of the actual Dslash class in use. I have not thought enough about this, but all those team/thread/lattice index calculations should be factored out of the Dslash classes. Perhaps one has a virtual base class for the Dslash and then only specializes minor parts? That would be virtual functions calls, which might not be a performance problem; we are calling the kernels with a function pointer and it seems to be innocuous. Or we could introduce the CRTP here in order to avoid the virtual calls.

bjoo commented 6 years ago

Here is my thought on this.

In general I was trying to follow GitFlow, which is to have a very stable master, a slowly changing devel which works with feature branhces.

However, I think one particular source of pain is that devel underwent a major (very positive) upheaval thanks to Martin’s and Bartosz’s efforts and has changed a lot structurally (with all the elimination of duplication). I would hope that this level of disruption, tho beneficial, is not going to be the status quo over the long term, and once the structure settles down these merges will not be so painful.

That having been said, there is likely to be a lot of experimentation, for example in comms strategies. So one question is how to have these best, and how invasive they are in the main code-body. For example i could imagine we could have several implementations of comm.h which we could select with CMake, however, they may have different interface code which would interfere in things like DPsiPlus/Minus. This can be sufficiently different code to have in different branches.

I think since these experiments are the projects of the individual authors, they should in principle be responsible for keeping them up to date with the current devel branch as needed unless another factor comes into play (e.g. need for expediency, or if someone really wants to test a different branch).

Ultimately we can follow the ‘trusted liutenants model’ to merge features onto the devel branch, but at the moment our group is small, so I think we can all be trusted liutenants, especially if our branches pass tests on travis before merging. So this is all fine as long as interfaces to the (e.g. to Chroma) don’t change, or change in a minor way, e.g. by adding extra optional params. If the interfaces change a lot we should have a discussion on Slack as to how that affects everyone.

Does this seem a reasonable attitude?

Best, B

PS: I don’t worry about cosmetics very much. Best, B

On Jul 20, 2017, at 1:41 AM, Martin Ueding notifications@github.com wrote:

This simply doesn't work in science in my experience...

My mindset with software development is usually that we are a software “company” and that we want to have the best product for our end-users. However, our product are the papers that we write with the data that we incidentally generate with our software. So I need to make sure that devel is rather conservative such that these long-running branches are less of a problem.

The prudent thing to do is to never make cosmetic changes […] as annoying as this may seem.

I am almost always falling into the boy-scouts pattern (leave the site cleaner than you found it) and not into the spec-ops pattern (make the needed changes and get out), therefore I rewrite comments, fix the formatting using clang-format or improve on something else. I'll try to keep the branches more focused on a single issue.

it must have been extremely challenging

Mostly irritating because git got thoroughly confused that the strong-scale changes were done in two copies of the functions (DPsiPlus and DPsiMinus) but on devel there was only one copy of these functions. So the merge copied all the code that I had removed back in and then it had merge conflicts reaching over function scopes. git mergetool with gvimdiff (or vimdiff) is a help there because you get four windows (common base, our, theirs, merge result).

These changes are probably worthwhile to also have in the twisted mass implementations. One could change the two Dslash classes as well, but I wouldn't feel too good about that from a software engineering point of view. It seems that the parallelization is independent of the actual Dslash class in use. I have not thought enough about this, but all those team/thread/lattice index calculations should be factored out of the Dslash classes. Perhaps one has a virtual base class for the Dslash and then only specializes minor parts? That would be virtual functions calls, which might not be a performance problem; we are calling the kernels with a function pointer and it seems to be innocuous. Or we could introduce the CRTP here in order to avoid the virtual calls.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 6 years ago

This can be sufficiently different code to have in different branches.

That is exactly what I would like to avoid. Basically this would abuse git as a feature selection tool. Refactoring becomes impossible because one would have to refactor both branches at the same time. There would be tons of merge conflicts and I doubt that one could easily switch those branches at will. And since one would like to keep both around, those would be eternal feature branches. When there is a new feature on devel, we would have to merge that into both communication strategy branches. Also it would not be clear which one we deem worthy for release on master.

Also it does not scale. Say we want to have another aspect in two different ways, then there are four branches if one wants to have every possibility. One project that I worked on had “compile branches” because the some compiler flags were hard coded in the Makefile and some intrinsic functions were hard coded in the code. You had to merge something in to compile it on BG/Q. Also this did not scale because for every branch one had to have an additional BG/Q “compile branch”.

Instead I would suggest to have a modular architecture with a clearly defined interface boundary. The communication interface is owned by the Dslash classes because we don't want to have any downward dependency arrows. The Dslash class must not care about the details of the communication, it sits above that. So we must have the interface defined on the level of the Dslash classes and then we can provide various communication implementations, like so:

comms1

This would be a clean object-oriented design pattern. If we don't want to use dynamic polymorphism (virtual functions), we can implement this with CRTP, but that is a detail.

https://martinfowler.com/bliki/BranchByAbstraction.html

If the interfaces change a lot we should have a discussion on Slack as to how that affects everyone.

Definitely! Ideally we would have very regular communication in order to prevent working on the same aspect in different branches for weeks/months.

I think since these experiments are the projects of the individual authors, they should in principle be responsible for keeping them up to date with the current devel branch […]

Both parties will be able to keep up with devel and the first one who gets to merge will have an easy time. But the second party to merge will have a hard time, especially if a lot has been changed.

This is exactly what has happened with the hacklatt-strongscale branch. So I fear that this will happen again, the devel changes in some way and somebody has a feature in an isolated branch that the person has a hard time to get to synced up with the new devel. And then it would be the “fault” of the person who has changed devel in a significant way.

I would like to have feature branches running only for a few days. If it is a new feature that is not ready yet, it can still be merged into devel but it will not be used. There is no damage in having additional functions around that are not used yet. However, one can already refactor since all the code is there. In case there is a change in implementation details (like with the comms), I would think that first abstracting the old into the above design pattern and then adding a new implementation is possible. Then one would have the old version still around, the new version can be implemented at any pace but still be merged into devel every day. And once it is done, adding a third implementation is really easy.

I don’t worry about cosmetics very much.

Unfortunately, for me it takes a lot of willpower to not brush up the cosmetics as I go. But I'll try to not tweak too much here and there and concentrate on the task at hand.

kostrzewa commented 6 years ago

@martin-ueding @azrael417 I've begun testing this branch on Marconi A2 (the KNL+OmniPath machine in Italy) and my jobs lock up unfortunately and get killed because they exceed the allotted wallclock time. Was the strongscale branch tested throroughly for deadlocks?

kostrzewa commented 6 years ago

I've started experimenting with (optionally!) offloading comms to a single or multiple (not yet implemented) cores / threads. Current progress on this, based on devel, for the Wilson clover dslash is in the offload_comms branch.

On the aforementioned Marconi A2, if the measurements can be trusted, I get about 20% speed-up at the far end of the strong-scaling window (up to fluctuations which are unfortunately still a major problem on this machine) and a similar speed-up in some parts of the strong scaling window on a Broadwell-based machine with the same network. Combined with lower OpenMP overheads from coarse-grained threading, this might substantially improve our strong-scaling potential.

It would be important to figure out, however, if I've encountered deadlocks or some race conditions in the strongscale branch, as mentioned in the comment above.

bjoo commented 6 years ago

Hi Bartosz,

There was the nesap-hacklatt-strongscale bramch which ran on Cori, but that is over a year old... It tried to split threads to send faces. I recall the X direction was tricky because for an Xh=4 lattice you could get races with some threads writing one end of a vector while others tried to write the other end. Strong scaling definitely needs more attention  I would strongly encourage your efforts in this direction.

Thanks and bes wishes, B

Dr Balint Joo, Scientific Computing Group, Jefferson Lab,  Tel: +1 757 269 5339 (Office)

Sent from my mobile phone

On Oct 31, 2017, 6:27 AM, at 6:27 AM, Bartosz Kostrzewa notifications@github.com wrote:

@martin-ueding @azrael417 I've begun testing this branch on Marconi A2 (the KNL+OmniPath machine in Italy) and my jobs lock up unfortunately and get killed because they exceed the allotted wallclock time. Was the strongscale branch tested throroughly for deadlocks?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_qphix_issues_91-23issuecomment-2D340721098&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=SC-qvz5njMoFH6cliT5XZQ&m=LTqIIN_6LqBZrd3vsgu8iP8HEZSbZin09pmmGeyioMo&s=kQ2-4-TUKSatmEBtGWGGY_-E6PByKvJqCRSajjISGAQ&e=

kostrzewa commented 6 years ago

Hi Balint,

yes, Martin merged the nesap-hacklatt-strongscale branch with devel a while ago and I tested this yesterday at scale. Unfortunately I encountered some lockups and thus wanted to ask if there were some known race conditions. My local lattice was much larger than Xh=4, so it's unlikely that it was this particular race.

I'll take a closer look and will try to figure out what went wrong. I agree that strong scaling is a top priority and given the results presented by the Tsukuba/Tokyo groups on the Oakforest-PACS machine, it seems that it should be possible to get much better performance with the right strategy.

QPhiX node-level performance is really quite good with the right setup and I'm able to get well in excess of 300 GFlop/s (ppn=1, sz=2, omp_num_threads=136, balanced kmp_affinity, 68-core KNL) even for small lattices (below 16^4) and would likely be even better with reduced OMP overheads. It should thus be possible to sustain 200 Gflop/s per node even on large machine partitions, as long as overlapping comms are forced to work.

martin-ueding commented 6 years ago

Perhaps it would be worthwhile to compile the original version from March (?). It might be possible that I have introduced that deadlock while merging. The older version likely uses GNU Autotools and it might not have all the needed AVX512 kernels available.

martin-ueding commented 6 years ago

My merged version (manual-hacklatt-strongscale) compiles on Travis CI and it also can do the tests with 2 MPI ranks using OpenMPI. If it is a threading deadlock (and not an MPI deadlock), one could perhaps gain some insights by compiling it with -fsanitize=thread?

martin-ueding commented 6 years ago

I have thought about having multiple communication strategies implemented in parallel. It seems to make more sense to generalize the Dslash operators (perhaps DiracParts is a better name for them given the A and A_inv terms?). Then one can define communication strategies that use those functions. This is what I have in mind right now:

screenshot_20171107_115658

The IndexMagic would absorb all those nifty index computations that are scattered through the code and are replicated in various places.