Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
182 stars 46 forks source link

Improvements to configuration and restart logic #269

Closed etremel closed 5 months ago

etremel commented 5 months ago

This branch makes two changes aimed at making it easier for nodes to rejoin a group after they crash and restart:

  1. Renames the leader_ip and leader_gms_port config options to contact_ip and contact_port, and removes the config option leader_external_port. This reflects existing behavior: Both internal and external clients only need to know the IP and port of any one group member, not the leader, in order to start up and connect.
  2. Changes the recover-from-logs procedure so that a restarting node will first attempt to do a "normal" join at the node specified by contact_ip (waiting for half of the configured restart_timeout), before examining the restart_leaders list and deciding if it should act as a restart leader. This allows the node that is configured as the restart leader to individually restart and rejoin the group (which is still running), instead of always attempting to act as a restart leader every time it starts up. It also matches the existing behavior for restarting nodes that are not configured as the restart leader: When they contact the "restart leader," if that leader responds with JoinResponse::OK instead of JoinResponse::TOTAL_RESTART, they do a normal join instead of a total restart.
songweijia commented 5 months ago

I tested this verison and it works.