OpenCyphal-Garage / cyraft

Raft algorithm (for pycyphal)
2 stars 2 forks source link

Complete with test _unittest_raft_leader_changes() #11

Closed Parsifal22 closed 1 month ago

Parsifal22 commented 2 months ago

In node.py: expanded the log information for some values ​​to make it easier to analyze the code

In raft_log_replication.py: start to write test that log replication happens correctly if leadership changes

In raft_node.py: add a couple of lines of code to close node after each test, so that after running pytest it does not crash

new issue: after the leader changes, the new leader is not aware of the latest index of the neighboring nodes and begins to rewrite them, even though his nodes completely coincide with the neighboring node.

Parsifal22 commented 2 months ago

Test case: There are 3 nodes in the cluster: 41, 42, 43. At first, the leader of node 41 receives 3 elements and then distributes them to followers. After that, node 42 becomes the leader and log files should be saved at all nodes.

Problem: Each leader has its own variable self._next_index it contains the values ​​of the current index for each follower. If the index is similar to the current user index, then it sends the heartbeat function, if not, then it starts sending its logs to the follower until there is an equal number: image

How it looks in the logs: image

When the leader changes to 42 nodes, their variable self._next_index with all the info about other logs is empty and it starts rewriting them again. How it looks in the logs: image

Since this prevented him from sending a heartbeat, other followers start putting forward their candidacy until 41 becomes the leader again, since his logs are updated and he can safely send a heartbeat signal.

Parsifal22 commented 2 months ago
Parsifal22 commented 2 months ago

Fixed these small issues: Removed unnecessary spaces between lines of code Replaced the function self._change_state(RaftState.FOLLOWER) with the function self._reset_election_timeout() in cyraft/node.py Reimplemented the _unittest_raft_node heartbeat() test to tests/raft_node.py

maksimdrachov commented 2 months ago

These commits related to CI shouldn't be here :)

Parsifal22 commented 2 months ago

Test case: There are 3 nodes in the cluster: 41, 42, 43. At first, the leader of node 41 receives 3 elements and then distributes them to followers. After that, node 42 becomes the leader and log files should be saved at all nodes.

Problem: Each leader has its own variable self._next_index it contains the values ​​of the current index for each follower. If the index is similar to the current user index, then it sends the heartbeat function, if not, then it starts sending its logs to the follower until there is an equal number: image

How it looks in the logs: image

When the leader changes to 42 nodes, their variable self._next_index with all the info about other logs is empty and it starts rewriting them again. How it looks in the logs: image

Since this prevented him from sending a heartbeat, other followers start putting forward their candidacy until 41 becomes the leader again, since his logs are updated and he can safely send a heartbeat signal.

New errors related to log replication have appeared. The error appears in the _unittest_raft_leader_changes() test in raft_log_replication.py. Case description: We have 4 nodes (41, 42, 43, 44). At the beginning, the leader node is 41, which receives and replicates 3 logs. After that, it disappears from the cluster: image

the new leader after this becomes node 42, which begins to rewrite data about all logs for all nodes as described above, except for node 41, since it is not yet in our cluster and for it the value of the next index is 1: image

Then we add a new log from node 42 and rewrite the third log. When the third log is rewritten, it automatically removes this value from the list and the following ones too (in our case, only 4) as shown in this code: image

the self._next_index array exists so that each node knows about other nodes what their next log index is. Since node 41 has a next index of 1, then its next index is -1, which creates a conflict in AppendEntries, since the prev_log_index value we send is self._next_index[remote_node_index] - 1, which gives a value of -2, and the data type of prev_log_index is uint64. In the end, we get this error: image

Parsifal22 commented 2 months ago

In this commit i updated the term

maksimdrachov commented 1 month ago

Before requesting a review: