eBay / NuRaft

C++ implementation of Raft core logic as a replication library
Apache License 2.0
1.02k stars 241 forks source link

RPC should set connection timeout. #411

Open JackyWoo opened 1 year ago

JackyWoo commented 1 year ago

In my project thread was stuck by NuRaft auto forwarding RPC connect method.

How to reproduce?

1.Thread node n1, n2, n3 and let's assume that n3 is leader. 2.Make network partition separating n3 with n1, n2 by iptables.

  1. wait new leader
  2. send request to n1

PS: you can increase OS option net.ipv4.tcp_syn_retries to increase system max connect timeout.

details see https://github.com/JDRaftKeeper/RaftKeeper/issues/42

Solution

I think we should add connection timeout.

greensky00 commented 1 year ago

There is already a timeout logic that calls when_done with failure, so as to make peer.cxx close the RPC client (hence the socket and connection). Do you mean this logic did not work as explained? https://github.com/eBay/NuRaft/blob/728644d0e1daa5d918c569e8f5a9486c47b64a0d/src/asio_service.cxx#L1028-L1040

JackyWoo commented 1 year ago

As my understanding the timer here is set to wait other thread connect and send data and when time is up it will retry send.

And I find socket connect has no timout:

https://github.com/eBay/NuRaft/blob/728644d0e1daa5d918c569e8f5a9486c47b64a0d/src/asio_service.cxx#L1231-L1248

I am not familar with NuRaft RPC and please correct me if I am wrong.

greensky00 commented 1 year ago

Ok got it, the existing timeout logic does not work for auto forwarding as there is no retry for auto forwarding. Need more time to think about it in detail, to unify it with existing timeout logic. Thanks.