Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
187 stars 47 forks source link

Improve TCP error handling and external client handling #243

Closed etremel closed 2 years ago

etremel commented 2 years ago

As Ken reported in an e-mail, an external client that attempts to contact the leader while a group is starting up (awaiting its first adequate view) will cause the group to crash. This happens for two reasons:

  1. The await_first_view() method assumes any client that connects to the leader is another replica node starting up, not an external client, so it does not properly handle external client requests the way process_new_sockets() would during normal operation. This means it will either fail or hang when it attempts to read joiner_gms_port and the other ports from the client socket, because external clients don't send this list of ports.
  2. The await_first_view() method doesn't catch TCP exceptions that socket operations might throw when it is doing the initial "handshake" of exchanging version codes and JoinRequest/JoinResponses, even though it properly catches exceptions later when attempting to send each node the initial View. This means that any client (replica node or external client) that opens a TCP socket but then immediately fails would cause the leader to crash during the "handshake" exchange.

To fix the first issue, I added the proper check for join_request.is_external and put any external clients that attempt to join during startup in a pending queue until the initial view commits. To fix the second issue, I added try-catch blocks to all the socket operations involved in handling a join request, both in await_first_view() and in the corresponding code in receive_join(). If a socket operation fails during the initial handshake process, the leader simply drops the failed connection and ignores the join request.

I also added a few more logging statements to the code that runs the initial startup sequence, to make it clearer what the system is waiting for when it appears to be stuck. They're at the "info" level so users will see them even with the default configuration that sets logging to info-only.

While testing these changes, I also improved external_notification_test and added a has_external_client() method to ExternalClientCallback (to help the server avoid a node_removed_from_group_exception if it tries to send a notification before the client is ready).