MPI_Isend in remote sender can be very delayed

cwpearson commented 4 years ago

A profile of exchange-strong 60 60 60 on 1 2-socket P9 node with 4 V100s, using DomainKernel and Staged methods.

Screenshot_20200922_144056

The host-to-host send is quite delayed from the pack/d2h copy, which completes shortly after it is initiated. This should be started as early as possible for two reasons: 1) to hide this latency 2) to overlap with CUDA as much as possible.

Possible solution: Start the stateful sender polling process more often and earlier, and suspend it if there is no work to do. Do we want to give the recvers opportunity to start early too? Probably not, as it might delay one of our sends?

cwpearson commented 4 years ago

A similar 2-node/3-rank per node run on Summit, with xyz=120. Here we can see each RemoteRecv is followed by the MPI send

cwpearson commented 4 years ago

The polling logic is reworked to favor sends before anything else. Now the 2-node/3-rank case looks like the above.

cwpearson / stencil

MPI_Isend in remote sender can be very delayed #27