The final step of creating an RDMA connection involves a call to connect_endpoint. This ensures that the local node is ready to receive/send data on the connection. To make sure that the other side of the connection has also finished this function, we used to have a barrier in the verbs code. This was somehow missed in the libfabric code because of which we encountered a race condition in the beginning of an SST test. One node successfully finished an RDMA write operation, yet the other side failed to see the update write.
The fix is simply to have the nodes creating the connection synchronize at the end.
The final step of creating an RDMA connection involves a call to
connect_endpoint
. This ensures that the local node is ready to receive/send data on the connection. To make sure that the other side of the connection has also finished this function, we used to have a barrier in the verbs code. This was somehow missed in the libfabric code because of which we encountered a race condition in the beginning of an SST test. One node successfully finished an RDMA write operation, yet the other side failed to see the update write. The fix is simply to have the nodes creating the connection synchronize at the end.