cowsql / raft

Asynchronous C implementation of the Raft consensus protocol
https://raft.readthedocs.io
Other
53 stars 6 forks source link

Add read handler to uv_send to detect remote socket close faster #188

Open NorbertHeusser opened 4 months ago

NorbertHeusser commented 4 months ago

While using the library in docker swarm or kubernetes environment we ran into issues of a simple three node cluster being in leader election state for about 20 minutes. The reason for this problem is a combination how of the way how tcp streams are used by the raft library in combination with the way docker warm or K8 deal with network socket on a service getting restarted. The raft library is currently using two different TCP stream between each pair of nodes in the cluster. Each node is using the TCP streams he opened as a client to send data to the other node. And each node reads data using the tcp stream he accepted as a server from the other node. In consequence all tcp streams in the raft library are used uni-directional. If a service container got restarted after a crash in a docker swarm or K8 environment the tcp endpoint got closed for the service container and a new service container with a new private IP gets spanned potentially on a different docker node. Now each remaining node will detect the remote close on the tcp streams he accepted as a server on the new read. But for the client stream used for sending he will still be able to add messages to the tcp output buffer without detecting the remote close. The remote close will only be detected, when the tcp output buffer of the node is filled up. In a very simply test case using a three node cluster we kill the current leader A and once a new leader is elected kill the new leader B as well. Now the remaining third node C will still be stuck with the remote closed outgoing connections to killed A and B. And A will be stuck with a remote closed outgoing connection to B. This way the outgoing connection is limited to leader elections messages and the cluster will be stuck for about 20 minutes in the leader election

For the current fix we simply added a uv_start_read on the outgoing tcp streams as well. As we don't expect any incoming data on this stream any triggering of the given callbacks are considered as a remote close and we close the tcp stream and socket. The existing reconnect code will handle the socket close and reconnect to the remove node.

NorbertHeusser commented 4 months ago

A more sophisticated solution would be to replace the two uni-directional TCP streams between each node pair by a single TCP stream between each pair of nodes. E.g. each node could open a connection to the all nodes with a higher node_id than his own node_id. But this would require bigger changes to the uv_send.c and uv_recv.c as the tcp stream would than be used for both directions. Even a rolling upgrade implementation for migration would be possible by adding a real read handler to the uv_send.c first and in a second step modify connect behavior.

NorbertHeusser commented 4 months ago

Fixed lint formating issue using clang-format.

freeekanayaka commented 4 months ago

Thanks, this looks good to me. I believe another option would be to set a TCP keepalive heartbeat, but the solution in this branch seems more lightweight.

freeekanayaka commented 4 months ago

Some of the downstream tests are failing.

I've fixed one test (incus), that was failing for an unrelated reason. But the other two failing tests (cowsql and dqlite) might actually be related to the change in this PR.

Please would you rebase this branch on top of the current cowsql/raft main branch?

That should clear at least the incus test failure. Then we'll probably need to look at the other two failures before merging this.

NorbertHeusser commented 4 months ago

Rebased the branch. Let's see what the result of tests are. The incus test is green now. The two others are still failing. Anything I can do ?

freeekanayaka commented 4 months ago

Rebased the branch. Let's see what the result of tests are. The incus test is green now. The two others are still failing. Anything I can do ?

Thanks. As I was suspecting, the failure that was definitely unrelated to this PR is now fixed, but the other two failures are still there. See:

https://github.com/cowsql/raft/actions/runs/9693679205/job/26749672239?pr=188

It's probably asking too much that you take a look at those failures, so I might do it myself, but beware that it will likely take me a while before I get into it.

NorbertHeusser commented 2 months ago

Tried to understand the problem with the failing downstream tests. But due to lack of knowledge in the go language and about the test I was not able to make any progress.

freeekanayaka commented 2 months ago

Tried to understand the problem with the failing downstream tests. But due to lack of knowledge in the go language and about the test I was not able to make any progress.

Thanks a lot for having tried. I'm afraid I could not yet tackle this.