Unsynchronized group startups can hang

etremel commented 3 years ago

While testing my code for #211 (which fixes #205), I discovered another bug that is present in the master branch (without any of my changes). It seems like, despite all my previous efforts to handle large numbers of joins correctly (for example, in fixing #126), it's still possible for a view change to get stuck when a big herd of nodes all start up and try to join a group at the same time, rather than joining one at a time. This only happens when the group has flexible-sized shards and can install a first view with a smaller number of members; an unsynchronized startup works fine if all of the nodes are necessary for the first view to be adequate.

This is how I can reproduce the problem:

Prepare 8 nodes to run view_change_test with 2 shards and a max size of 4, i.e. view_change_test 2 4
Start all 8 nodes at once
All the nodes will successfully install views 1, 2, and 3, but two of them will then erroneously detect a new change proposal from the leader and wedge View 3. The leader will not continue a view change to View 4, and these nodes will remain stuck.
Meanwhile, the nodes in a different shard from the two stuck ones will continue sending updates in View 3, unaware that the other shard is wedged (since the leader did not actually propose any new changes after View 3).

The spurious "new change proposal" that the stuck nodes detect seems to be the last proposal that was included in View 3, namely the proposal to add the slowest-starting of the 8 nodes. I'm guessing that they somehow failed to increment their num_acked or num_installed counters when they installed View 3, even though the other nodes did it correctly.

Once I finish testing and merging #211, I'll track down and fix this bug, but I'm going to leave it for now since it's unrelated to my current changes.

etremel commented 3 years ago

Update: After implementing the fix to the non-leader nodes' view-change trigger in e5b7ae2a525c75ebaadb46b11ae35a4bed43c496, I have identified another reason why nodes can get stuck after a rapid sequence of View changes. The problematic behavior looks like this:

The 8 nodes all start up, and nodes 0, 5, and 2 create View 1. Then nodes 3 and 6 join in View 2.
Node 6 installs View 2, then starts doing its first ordered_send. The ordered_send completes multicast_group->send, which creates message 0 in the SSTMC outgoing buffer, and returns to the client.
Nodes 1, 7, and 4 join, which starts the transition to View 3, so Node 6 wedges its view before message 0 can be sent.
Node 6 goes through the view-change process, which includes "Waiting for pending SST sends to finish" (this message is printed from the terminate_epoch method), but SSTMC message 0 is not sent.
View 3 is installed, and the other nodes in Node 6's shard begin sending messages. The SSTMC delivery trigger delivers these messages to node 6, but message 0 is never delivered. Thus, node 6's main thread stays blocked on the QueryResults::get() method for its first ordered_send (since message 0 was never sent, a reply map was never delivered to the QueryResults), and it never sends any more messages.

The part that seems to be going wrong is that the "Waiting for pending SST sends to finish" step doesn't send the message that a previous call to MulticastGroup::send() created. I didn't write the SSTMC send logic, though, so I don't have a clear idea why this might be happening. The wait is implemented by the method check_pending_sst_sends(), which simply returns the value of pending_sst_sends[subgroup_num], but for some reason the send() method sets pending_sst_sends to false even though it doesn't wait until the message is sent before returning - it simply increments committed_sst_index, which will later cause sst_send_trigger() to send the message. Is this behavior expected? How does the send() method guarantee that the message it created will definitely get sent? Something must be violating an invariant that send() expects to hold true, since it returns "success" to the caller for message 0 but message 0 never actually gets sent.

KenBirman commented 3 years ago

This would seem to be an issue with @sagarjha's code path to reissue the multicast send (message 0 from node 6).

The way the protocol is supposed to work in this case is as follows:

Everything is as you described through step 3.
But this implies that the send for message 0 will "fail" -- message 0 cannot be sent in view 2. As a result, message 0 remains available to send, but needs to be "resent" in view 3 or a future view -- in effect, the trigger for trying to send it triggers twice, once to try and send it in view 2 (but this will fail due to the group being wedged), and so the message becomes sendable "again" but now in view 3, at which point the send actually does go through. Or, for that matter, it could wedge again (maybe due a failure), didn't achieve stability, hence was trimmed and will be auto-resent in view 3.

The logic needs to preserve the sender order used by node 6 -- instead of message 0, imagine that this was message 1000 and that the application logic has already prepared messages 1001 through 2000. Our message 1000 needs to be resent but without losing its sequencing and this must work whether by SMC or RDMC.

sagarjha commented 3 years ago

Edward and I met today for 4 hours to work on this. There was a bug in the SST part introduced after we added asynchronous sending. That turned out to be an easy fix, but then we ran into a problem with the RPC function logic which took a long time to diagnose. Edward is going to fix the bug himself and test things out. And we will meet again if needed.

etremel commented 3 years ago

All bugs related to this issue were fixed in #216.

Derecho-Project / derecho

Unsynchronized group startups can hang #213