E3SM-Project / HOMMEXX

Clone of ACME for CMDV-SE project to convert HOMME to C++
11 stars 0 forks source link

MPI merge messages #174

Closed ambrad closed 6 years ago

ambrad commented 6 years ago

See commit messages for details.

Still need to restore some debug stuff that I temporarily altered for brevity.

Won't have this ready until ~Mon.

Before merging:

ambrad commented 6 years ago

Need to check whether pack/unpack kernels benefit from using m_i2ec, particularly on KNL.

bartgol commented 6 years ago

I don't have time to review it now, and I'm off tomorrow. I may not be able to review it till Monday. If urgent, and if @mfdeakin-sandia approves, go ahead and merge.

ambrad commented 6 years ago

That's fine. I meant urgent relative to my relative assessment that it could wait. Mon or Tues will be fine.

ambrad commented 6 years ago

Note to self when I pick this up again: For diff checking, I need to remember that the original baselines might be the buggy ones (with diffs at the 1e-14 normalized level).

ambrad commented 6 years ago

Measurements on GPU:

e 5, 2 ranks, 2 gpus, 3 interlaced measurements

master:
be recv waitall                      2        2 3.168000e+03   2.188486e+00     1.096 (     1      0)     1.093 (     0      0)
be recv waitall                      2        2 3.168000e+03   2.176116e+00     1.089 (     1      0)     1.087 (     0      0)
be recv waitall                      2        2 3.168000e+03   2.171314e+00     1.091 (     0      0)     1.080 (     1      0)

branch:
be recv waitall                      2        2 3.168000e+03   3.133976e-01     0.181 (     0      0)     0.132 (     1      0)
be recv waitall                      2        2 3.168000e+03   2.724059e-01     0.137 (     0      0)     0.135 (     1      0)
be recv waitall                      2        2 3.168000e+03   2.795641e-01     0.147 (     0      0)     0.132 (     1      0)

On GPU, using i2ec in the pack kernel, where it could make sense (to make writes more contiguous), doesn't help (actually hurts slghtly b/c of a bit more integer arithmetic). That's fine -- pack is already fast -- but it was worth trying and measuring. Bottom line is the new (send pid, recv pid) method speeds up on-node MPI costs by ~5x and is neutral w.r.t. pack/unpack kernel performance.

Now there are exactly 2 tags, 1000 and 2000. Thus, there is no threat of the tag scheme aliasing.

ambrad commented 6 years ago

Now that there are not that many messages, it might make sense to have a loop containing MPI_Waitany and unpack kernel for just that message. Similar for pack and send.

bartgol commented 6 years ago

I don't understand what's the point of the i2ec structure. Also, does the speedup come from turning team policies into flat range policies?

ambrad commented 6 years ago

Previously, a send request was created for each (element, info) pair such that info.sharing == ConnectionSharing::SHARED, and the same for receives.

Consider, for example, using two ranks total. Then this procedure gives roughly 6*ne sends and 6*ne receives between those two ranks per round, where ne is the cubed-sphere ne parameter. (This is a very rough estimate that is sensitive to how the mesh decomp is done; in any case, it is likely always at least proprotional to ne.)

In fact, one wants exactly one send and one receive per round. The one exception is if one is trying to overlap computation and communication between a single comm pair. (One can still overlap among different comm pairs with monolithic messages.)

Another problem with the original approach is that MPI tags are needed to distinguish the multiple messages between the same two ranks in the same round. The tagging scheme can alias if the number of lids times NUM_CONNECTIONS is more than 1000. This can happen if ne is increased sufficiently for fixed number of ranks. Monolithic messages solve this problem; we now need just two tags: one for the DSS-type exchange, and one for the min/max-type exchange, since the two can overlap.

Now, the sequence k = ie*NUM_CONNECTIONS + iconn does not index the (element, info) pairs such that info.remote_pid is contiguous. This means one can't use it to set up the send and receive buffers so that the one monolithic message between a communication pair ends up in one section of the buffer. i2ec maps this sequence to one in which all (element, info) pairs having the same remote_pid are contiguous, providing this capability.

Nothing was done to the kernels themselves; the speedup is purely in reducing the number of messages from many to just one send and one receive per comm pair per round.

Next, I'm going to see if we can use pack/MPI_Start and Waitany/unpack at the level of each message to overlap un/packing and comm.

bartgol commented 6 years ago

Ah, now I understand, thanks. And it makes sense. I didn't think of this when I wrote the class' logic. I guess I had the 1elem/rank ratio stuck in my mind...