Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
182 stars 46 forks source link

Fix crash in total restart caused by issue #252 #257

Closed etremel closed 10 months ago

etremel commented 10 months ago

This is a fix for the bug identified in #252. A node that thinks it is not in total-restart mode but gets a TOTAL_RESTART response from the leader will recover from the situation by creating an empty curr_view with VID -1, instead of crashing because it attempts to send a null curr_view to the leader. The restart leader will ignore this view, since it has an "older" VID than the leader's current view, and treat the new node as a very out-of-date node that was not in the last known view. Thus the new node can't contribute to achieving a restart quorum (which it shouldn't) but it can get added as a new member as soon as the restart finishes.

While fixing this, I realized that the restart leader would also crash when the non-restarting node crashed, because it didn't properly handle the node failing to send a view (it assumed that any node that successfully connected would at least be able to send its view). I added better error-handling code to RestartLeaderState so that joining node failures can't make the leader crash.