I noticed this bug in the restart-from-logs procedure, and I intend to fix it when I get the chance. I'm writing it down here for documentation purposes.
If the Derecho group leader starts the total-restart process (i.e. starts up and notices there is logged state on disk), it will expect every other node that contacts it to be another restarting node, with its own logged state to recover and synchronize with the leader. However, I neglected to consider the possibility that a new, non-restarting node happens to attempt to join the group while the leader is still in total-restart mode. If this happens, the leader will tell the new node to send its logged View, and the new node will promptly crash because it has no logged View to send. Specifically, when the new node gets the response JoinResponseCode::TOTAL_RESTART from the leader in ViewManager's receive_initial_view(), it will attempt to serialize and send curr_view to the leader, but curr_view is a null pointer because the node has not yet received the initial view and had no logged View on disk to load into curr_view when it started up.
I can probably handle this by having the new node send an "empty" View (one that exists but has no members) to the leader as a way to indicate that it is not a restarting node, and then having the leader defer all non-restarting nodes until after the restart process has finished (similar to how we defer external-client requests until after the group has started up).
I noticed this bug in the restart-from-logs procedure, and I intend to fix it when I get the chance. I'm writing it down here for documentation purposes.
If the Derecho group leader starts the total-restart process (i.e. starts up and notices there is logged state on disk), it will expect every other node that contacts it to be another restarting node, with its own logged state to recover and synchronize with the leader. However, I neglected to consider the possibility that a new, non-restarting node happens to attempt to join the group while the leader is still in total-restart mode. If this happens, the leader will tell the new node to send its logged View, and the new node will promptly crash because it has no logged View to send. Specifically, when the new node gets the response JoinResponseCode::TOTAL_RESTART from the leader in ViewManager's receive_initial_view(), it will attempt to serialize and send
curr_view
to the leader, butcurr_view
is a null pointer because the node has not yet received the initial view and had no logged View on disk to load intocurr_view
when it started up.I can probably handle this by having the new node send an "empty" View (one that exists but has no members) to the leader as a way to indicate that it is not a restarting node, and then having the leader defer all non-restarting nodes until after the restart process has finished (similar to how we defer external-client requests until after the group has started up).