Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
187 stars 47 forks source link

LibFabric is missing the sync after creating an RDMA connection #146

Closed ellerre closed 5 years ago

ellerre commented 5 years ago

The final step of creating an RDMA connection involves a call to connect_endpoint. This ensures that the local node is ready to receive/send data on the connection. To make sure that the other side of the connection has also finished this function, we used to have a barrier in the verbs code. This was somehow missed in the libfabric code because of which we encountered a race condition in the beginning of an SST test. One node successfully finished an RDMA write operation, yet the other side failed to see the update write. The fix is simply to have the nodes creating the connection synchronize at the end.