Closed etremel closed 3 years ago
Update: After implementing the fix to the non-leader nodes' view-change trigger in e5b7ae2a525c75ebaadb46b11ae35a4bed43c496, I have identified another reason why nodes can get stuck after a rapid sequence of View changes. The problematic behavior looks like this:
multicast_group->send
, which creates message 0 in the SSTMC outgoing buffer, and returns to the client.QueryResults::get()
method for its first ordered_send (since message 0 was never sent, a reply map was never delivered to the QueryResults), and it never sends any more messages.The part that seems to be going wrong is that the "Waiting for pending SST sends to finish" step doesn't send the message that a previous call to MulticastGroup::send() created. I didn't write the SSTMC send logic, though, so I don't have a clear idea why this might be happening. The wait is implemented by the method check_pending_sst_sends()
, which simply returns the value of pending_sst_sends[subgroup_num]
, but for some reason the send()
method sets pending_sst_sends
to false even though it doesn't wait until the message is sent before returning - it simply increments committed_sst_index
, which will later cause sst_send_trigger()
to send the message. Is this behavior expected? How does the send()
method guarantee that the message it created will definitely get sent? Something must be violating an invariant that send()
expects to hold true, since it returns "success" to the caller for message 0 but message 0 never actually gets sent.
This would seem to be an issue with @sagarjha's code path to reissue the multicast send (message 0 from node 6).
The way the protocol is supposed to work in this case is as follows:
The logic needs to preserve the sender order used by node 6 -- instead of message 0, imagine that this was message 1000 and that the application logic has already prepared messages 1001 through 2000. Our message 1000 needs to be resent but without losing its sequencing and this must work whether by SMC or RDMC.
Edward and I met today for 4 hours to work on this. There was a bug in the SST part introduced after we added asynchronous sending. That turned out to be an easy fix, but then we ran into a problem with the RPC function logic which took a long time to diagnose. Edward is going to fix the bug himself and test things out. And we will meet again if needed.
All bugs related to this issue were fixed in #216.
While testing my code for #211 (which fixes #205), I discovered another bug that is present in the master branch (without any of my changes). It seems like, despite all my previous efforts to handle large numbers of joins correctly (for example, in fixing #126), it's still possible for a view change to get stuck when a big herd of nodes all start up and try to join a group at the same time, rather than joining one at a time. This only happens when the group has flexible-sized shards and can install a first view with a smaller number of members; an unsynchronized startup works fine if all of the nodes are necessary for the first view to be adequate.
This is how I can reproduce the problem:
view_change_test 2 4
The spurious "new change proposal" that the stuck nodes detect seems to be the last proposal that was included in View 3, namely the proposal to add the slowest-starting of the 8 nodes. I'm guessing that they somehow failed to increment their num_acked or num_installed counters when they installed View 3, even though the other nodes did it correctly.
Once I finish testing and merging #211, I'll track down and fix this bug, but I'm going to leave it for now since it's unrelated to my current changes.