IntelligentSoftwareSystems / Galois

Galois: C++ library for multi-core and multi-node parallelization
http://iss.ices.utexas.edu/?p=projects/galois
Other
310 stars 131 forks source link

Question of the MPI Network Layer (is the order guaranteed for the message?) #397

Open Steamgjk opened 2 years ago

Steamgjk commented 2 years ago

Hi, Galois Staff.

I am now studying the network layer of Galois (mainly NetworkBuffer.cpp and NetworkIOMPI.cpp)

I notice that, in async mode, Galois is calling MPI_ISend and MPI_IRecv for communication, and messages are given different tags (based on galois::runtime::evilPhase).

I can understand that messages are sent in the asending order of tag (evilPhase).

But I found here (refer to the link below) that MPI does not guanratee messages are received in the same order, if they are sent with different tags https://stackoverflow.com/questions/20909648/is-there-a-fifo-per-process-fifo-guarantee-on-mpi-isend

That means, when MPI_ISend send 2 messages with tag1 and tag2 respectively, tag1< tag2 (Let's say tag1=1, tag2=2), the other node may receive tag2 first ahead of tag1, because in https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libdist/src/NetworkIOMPI.cpp#L147

it is probing any tag. If tag2 is first recived and put into done queue, and the workerThread copies it to recvBuffer.data.
https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libdist/src/NetworkBuffered.cpp#L414-L425

Later syncNetRecv calls receiveTagged with the tag=1,

https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libgluon/include/galois/graphs/GluonSubstrate.h#L2490-L2504

But the head of tag =2 https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libdist/src/NetworkBuffered.cpp#L476

https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libdist/src/NetworkBuffered.cpp#L77-L78

In that way, we can never match the tag and get the message with tag=1 (because the data queue is a deque type rather than a priority queue). Once reordering really occurs and the larger tag (tag=2) is received first, then it will cause head-of-line blocking. I am not sure whether my guess can happen, so I hope to get some explanation from you staff.

Steamgjk commented 2 years ago

BTW: What is the intuition behind the assert statement? Why are you so sure that data.back().data.size() will not equal the right hand?

https://github.com/IntelligentSoftwareSystems/Galois/blob/59d5aa54b4a7b6594660bb1b18b83ae6875b954c/libdist/src/NetworkBuffered.cpp#L226-L229

roshandathathri commented 2 years ago

NetworkInterface does not guarantee in-order delivery (even for the same tag). Gluon(-Async) ensures that all messages are eventually received. You can read the Gluon-Async paper for more details.

The assert in line 226 is just asserting that a message with all 0s is NOT sent/received.

Steamgjk commented 2 years ago

Hi, @roshandathathri
Thanks for your explanation. I am a bit suprised that "no in-order delivery EVEN for the SAME tag". Does that mean, the anwser here is wrong? https://stackoverflow.com/questions/20909648/is-there-a-fifo-per-process-fifo-guarantee-on-mpi-isend

"In MPI, you are guaranteed that all messages on the same communicator/tag/rank combo will be received in the same order they were sent."

roshandathathri commented 2 years ago

No, MPI guarantees in-order delivery for the same tag. NetworkInteface doesn't because it can buffer messages and return smaller, later messages before larger, earlier messages are received.

Steamgjk commented 2 years ago

No, MPI guarantees in-order delivery for the same tag. NetworkInteface doesn't because it can buffer messages and return smaller, later messages before larger, earlier messages are received.

Cool @roshandathathri . We are on the same page regarding "in-order delivery for the same tag".

Then, it comes back to the original question I proposed, related to the implementation codebase. Let me rephrase it more briefly. (Please correct me if I am wrong)

The receiver side actually is using one workerThread, with a while loop. It keeps doing Send, Probe, and Receive with MPI.

When something is probed, then it will continue to call MPI_IRecv to get the message, and put it into a FIFO queue (a std::deque), then the incoming message is dequeued following the FIFO order when the other threads call receiveTagged, with a tag parameter.

My concern is, the tag parameter is incrementally increasing, (regarding the evliPhase variable), so the application must receive tag=1, and then tag=2.

But if the message with tag=2 is probed earlier and enqued earlier, it will stay at the top of the queue. Whenever the receiveTagged (with tag=1) API is called, it will check rq.hasData, then it finds the top message's tag is not 1, it will return false. In that way, will that be stuck here? (just like head-of-line blocking)

roshandathathri commented 2 years ago

Yes, the application is expected to check for both tags and NOT block/wait on one tag.