In the current design, each MPI rank can have mutiple compute regions. This means that a rank may send multiple identical messages to another rank when transfering halos between regions.
Currently, this is handled by sending each transfer kind in diffferent waves, with a barrier between them. This means we have multiple MPI barriers, which may slow down communication.
The other option is to overlap all messages, but then we have to make the messages unique so they can be disambiguated at recv.
To make these messages unique, we need to completely specify the communication type in the tag: have to encode dstGPU, direction, data field, and transfer kind.
For example, if we supported 8 GPUs and 64 data fields
In the current design, each MPI rank can have mutiple compute regions. This means that a rank may send multiple identical messages to another rank when transfering halos between regions.
Currently, this is handled by sending each transfer kind in diffferent waves, with a barrier between them. This means we have multiple MPI barriers, which may slow down communication.
The other option is to overlap all messages, but then we have to make the messages unique so they can be disambiguated at recv. To make these messages unique, we need to completely specify the communication type in the tag: have to encode dstGPU, direction, data field, and transfer kind.
For example, if we supported 8 GPUs and 64 data fields
dstGPU: 3 bits (0-7) dataIdx: 6 bits (0-63) direction: 1 bit (pos/neg) 3D transfer kinds: +x face, -x face +y face, -y face +z face, -z face +x/+z edge, +x/-z edge +x/+y edge, +x/-y edge -x/+z edge, -x/-z edge -x/+y edge, -x/-y edge +y/+z edge, +y/-z edge -y/+z edge, -y/-z edge +x/+y/+z, +x/+y/-z corner +x/-y/+z, +x/-y/-z corner -x/+y/+z, -x/+y/-z corner -x/-y/+z, -x/-y/-z corner
encode kind with a value for each dimension 0: not present 1: negative 2: positive 3: reserved require 2 bits per dimension
22 bits left in the tag would support up to 11 dimensional stencils.