Closed cwpearson closed 4 years ago
A similar 2-node/3-rank per node run on Summit, with xyz=120. Here we can see each RemoteRecv is followed by the MPI send
The polling logic is reworked to favor sends before anything else. Now the 2-node/3-rank case looks like the above.
A profile of
exchange-strong 60 60 60
on 1 2-socket P9 node with 4 V100s, using DomainKernel and Staged methods.The host-to-host send is quite delayed from the pack/d2h copy, which completes shortly after it is initiated. This should be started as early as possible for two reasons: 1) to hide this latency 2) to overlap with CUDA as much as possible.
Possible solution: Start the stateful sender polling process more often and earlier, and suspend it if there is no work to do. Do we want to give the recvers opportunity to start early too? Probably not, as it might delay one of our sends?