eBay / NuRaft

C++ implementation of Raft core logic as a replication library
Apache License 2.0
993 stars 235 forks source link

Misleading append_entries result code #513

Closed amichon-kalray closed 2 months ago

amichon-kalray commented 2 months ago

Hello,

I ran into an issue with NuRaft v2.1.0 where after calling append_entries in auto forwarding mode, my callback is called with accepted=1 and result_code=0 but I cannot find my data.

Reproducing

Here's how I can easily reproduce: Parameters:

I start 5 servers. For tests purposes, I run 2 servers in a network namespace netns1 and 3 servers in another one netns2. The two netns are connected with a veth interface. I use netem to add some delay between the two netns.

To reproduce this bug, the leader must be one of the 2 servers in netns1. I call append_entries with some data on the follower in netns1. Then I bring down the veth interface between the two network namespaces after waiting some time shorter than the configured netem delay.

I'm expecting to see an error in the cmd_result I get in my callback for append_results on the follower. But it's not the case. Then I can not find my data anywhere on any server.

Analysis

I dug into the code and here's my understanding: 1) The follower forwards a client_request to the leader in netns1. 2) Then the leader sends the request to every follower in handle_cli_request. It marks the request as accepted. 3) The leader's leadership expires. It cancels all pending requests in drop_all_pending_commit_elems 4) The leader sets the result code in the response object as CANCELLED. 5) The leader crafts the response message for the client_request to send back to the follower. The result code is not part of the encoded response message, so only accepted is set to 1. 6) The follower receives and decodes the response message. It creates a resp_msg object, sets accepted to 1 but uses the default result_code in the constructor, i.e. cmd_result_code::OK.

Here are the logs of the leader: logs.txt

Proposed solution

Here's a patch that I made that seems to solve the issue (note that I'm using version v2.1.0 so I didn't see the patch d195630b): patch.txt

greensky00 commented 2 months ago

Hi @amichon-kalray thanks for bringing it up.

Just want to make sure that you will not have this issue with the latest commit, as it has been fixed in this commit, right?

We are about to release a new version, so please use it once the tag is created. In eBay, we are using the commit that is much newer than v2.1.0. Thanks.

amichon-kalray commented 2 months ago

Unfortunately, commit d195630 doesn't fix the issue because the result code is not included in the serialized response message.

greensky00 commented 2 months ago

Now I get your point. But modifying the response message format is not feasible, as it will break backward compatibility of existing systems.

I'd rather put the result code (4-byte integer) into ctx area, then it can deliver the result code without changing the response format (hence not breaking the compatibility). Please let me know if you want me to implement this fix. Thanks.