Closed ambrad closed 6 years ago
Need to check whether pack/unpack kernels benefit from using m_i2ec, particularly on KNL.
I don't have time to review it now, and I'm off tomorrow. I may not be able to review it till Monday. If urgent, and if @mfdeakin-sandia approves, go ahead and merge.
That's fine. I meant urgent relative to my relative assessment that it could wait. Mon or Tues will be fine.
Note to self when I pick this up again: For diff checking, I need to remember that the original baselines might be the buggy ones (with diffs at the 1e-14 normalized level).
Measurements on GPU:
e 5, 2 ranks, 2 gpus, 3 interlaced measurements
master:
be recv waitall 2 2 3.168000e+03 2.188486e+00 1.096 ( 1 0) 1.093 ( 0 0)
be recv waitall 2 2 3.168000e+03 2.176116e+00 1.089 ( 1 0) 1.087 ( 0 0)
be recv waitall 2 2 3.168000e+03 2.171314e+00 1.091 ( 0 0) 1.080 ( 1 0)
branch:
be recv waitall 2 2 3.168000e+03 3.133976e-01 0.181 ( 0 0) 0.132 ( 1 0)
be recv waitall 2 2 3.168000e+03 2.724059e-01 0.137 ( 0 0) 0.135 ( 1 0)
be recv waitall 2 2 3.168000e+03 2.795641e-01 0.147 ( 0 0) 0.132 ( 1 0)
On GPU, using i2ec in the pack kernel, where it could make sense (to make writes more contiguous), doesn't help (actually hurts slghtly b/c of a bit more integer arithmetic). That's fine -- pack is already fast -- but it was worth trying and measuring. Bottom line is the new (send pid, recv pid) method speeds up on-node MPI costs by ~5x and is neutral w.r.t. pack/unpack kernel performance.
Now there are exactly 2 tags, 1000 and 2000. Thus, there is no threat of the tag scheme aliasing.
Now that there are not that many messages, it might make sense to have a loop containing MPI_Waitany and unpack kernel for just that message. Similar for pack and send.
I don't understand what's the point of the i2ec structure. Also, does the speedup come from turning team policies into flat range policies?
Previously, a send request was created for each (element, info) pair such that
info.sharing == ConnectionSharing::SHARED
, and the same for receives.
Consider, for example, using two ranks total. Then this procedure gives roughly
6*ne
sends and 6*ne
receives between those two ranks per round, where ne
is the cubed-sphere ne
parameter. (This is a very rough estimate that is
sensitive to how the mesh decomp is done; in any case, it is likely always at
least proprotional to ne
.)
In fact, one wants exactly one send and one receive per round. The one exception is if one is trying to overlap computation and communication between a single comm pair. (One can still overlap among different comm pairs with monolithic messages.)
Another problem with the original approach is that MPI tags are needed to
distinguish the multiple messages between the same two ranks in the same
round. The tagging scheme can alias if the number of lid
s times
NUM_CONNECTIONS
is more than 1000. This can happen if ne
is increased
sufficiently for fixed number of ranks. Monolithic messages solve this problem;
we now need just two tags: one for the DSS-type exchange, and one for the
min/max-type exchange, since the two can overlap.
Now, the sequence k = ie*NUM_CONNECTIONS + iconn
does not index the (element,
info) pairs such that info.remote_pid
is contiguous. This means one can't use
it to set up the send and receive buffers so that the one monolithic message
between a communication pair ends up in one section of the buffer. i2ec
maps
this sequence to one in which all (element, info) pairs having the same
remote_pid
are contiguous, providing this capability.
Nothing was done to the kernels themselves; the speedup is purely in reducing the number of messages from many to just one send and one receive per comm pair per round.
Next, I'm going to see if we can use pack/MPI_Start and Waitany/unpack at the level of each message to overlap un/packing and comm.
Ah, now I understand, thanks. And it makes sense. I didn't think of this when I wrote the class' logic. I guess I had the 1elem/rank ratio stuck in my mind...
See commit messages for details.
Still need to restore some debug stuff that I temporarily altered for brevity.
Won't have this ready until ~Mon.
Before merging:
[x] Use i2ec in kernels? (No diff on GPU and KNL.)
[x] Overlap pack/send, unpack/waitany? (Can't. DSS receives can't be overlapped b/c sums must be in deterministic order. This receive dominates min/max send and recv and DSS send, so there's little point in overlapping these.)
[x] Expand variable names: os -> offset, i2ec -> buf_idx_to_elem_comm_pair_idx. I'll do this after everything else.