Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
182 stars 46 forks source link

During total restart, new nodes crash if they try to join #252

Closed etremel closed 10 months ago

etremel commented 1 year ago

I noticed this bug in the restart-from-logs procedure, and I intend to fix it when I get the chance. I'm writing it down here for documentation purposes.

If the Derecho group leader starts the total-restart process (i.e. starts up and notices there is logged state on disk), it will expect every other node that contacts it to be another restarting node, with its own logged state to recover and synchronize with the leader. However, I neglected to consider the possibility that a new, non-restarting node happens to attempt to join the group while the leader is still in total-restart mode. If this happens, the leader will tell the new node to send its logged View, and the new node will promptly crash because it has no logged View to send. Specifically, when the new node gets the response JoinResponseCode::TOTAL_RESTART from the leader in ViewManager's receive_initial_view(), it will attempt to serialize and send curr_view to the leader, but curr_view is a null pointer because the node has not yet received the initial view and had no logged View on disk to load into curr_view when it started up.

I can probably handle this by having the new node send an "empty" View (one that exists but has no members) to the leader as a way to indicate that it is not a restarting node, and then having the leader defer all non-restarting nodes until after the restart process has finished (similar to how we defer external-client requests until after the group has started up).