dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.56k stars 119 forks source link

Not seeing expected improvement in throughput of RaftCluster.ReplicateAsync method when cluster minority is inaccessible #233

Closed LarsWithCA closed 2 months ago

LarsWithCA commented 2 months ago

This plots shows the timing (in ms) of RaftCluster.ReplicateAsync - at the vertical green line 1 node is disconnected (out of a cluster of 6 nodes in total): (Linux ARM + .NET6 + DotNext.Net.Cluster 4.14.1)

image

From the change log of DotNext.Net.Cluster 4.15.0:

Raft performance: improved throughput of IRaftCluster.ReplicateAsync method when cluster minority is not accessible (faulty node). Now the leader waits for replication from majority of nodes instead of all nodes

This made us hope that we would no longer see these kinds of longer timings in case of inaccessible cluster minority. However, we see a pretty similar plot - at the green line 1 node is disconnected (out of a cluster of 6 nodes): (Linux ARM + .NET8 + DotNext.Net.Cluster 5.3.0)

image

Did we have wrong expectations, or are we doing something wrong?

sakno commented 2 months ago

(in ms) of RaftCluster.ReplicateAsync

It is a measurement of latency, not throughput. By calling ForceReplicationAsync on every write, you choose better latency by the cost of low throughput. With ForceReplicateAsync you don't need to wait for the next heartbeat round, that's why latency improves. However, if you want to have good throughput, you need to accumulate as much uncommitted log entries as possible and wait for replication. It can be done without explicit call of ForceReplicationAsync. If you have just one writer, prefer latency over throughput.

LarsWithCA commented 2 months ago

Hi @sakno

I can see that we messed up the terms a bit 😊

From the wording in the changelog:

Now the leader waits for replication from majority of nodes instead of all nodes.

we were hoping to also see an improvement on latency (in case of inaccessible cluster minority). Are you saying that is not the case? Or are you saying we should just call ForceReplicationAsync instead?

We have a single writer in our system (and we do prefer latency over throughput).

sakno commented 2 months ago

we were hoping to also see an improvement on latency

Not exactly, it's improvement over throughput from client perspective. However, the underlying state machine commits changes earlier. A small recap:

  1. ForceReplicationAsync resumes when all nodes replicated (or some of them detected as unavailable)
  2. AppendAsync from WAL invokes before ForceReplication because only majority is needed to mark log entries as committed
sakno commented 2 months ago

If you want to improve latency in presence of unavailable nodes, you can use the following techniques:

  1. Reduce connect timeout
  2. Enable automatic failure detection so the leader can remove unhealthy nodes automatically from the list of cluster members
LarsWithCA commented 2 months ago

Reduce connect timeout

Our ConnectTimeout is only 10ms. Did you mean another timeout? We have 'RequestTimeout' set to 1 second.

sakno commented 2 months ago

The leader exposes broadcast-time counter in milliseconds. You can measure it with dotnet-counters. What's the average value of this counter with and without unavailable nodes?

LarsWithCA commented 2 months ago

Average of broadcast-time before/after disconnecting a node:

(Linux ARM + .NET8 + DotNext.Net.Cluster 5.4.0)

sakno commented 2 months ago

How many nodes are disconnected? 920 ms with 1 disconnected node?

LarsWithCA commented 2 months ago

A single disconnected node (out of a cluster of 6 in total).

sakno commented 2 months ago

Very suspicious, 920ms - 5ms doesn't give 10ms for connection timeout.

sakno commented 2 months ago

One more way to investigate the issue. There is response-time counter that shows response time for each node individually. Could you share this value for each node? Every counter has a tag indicating IP address of the node (you can replace IP addresses with anything else, it's needed just to distinguish values).

A group of metrics is DotNext.Net.Cluster.Consensus.Raft.Client, not DotNext.Net.Cluster.Consensus.Raft.Server.

LarsWithCA commented 2 months ago

Example before disconnecting 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.375 [dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.296875 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 10.375 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.671875 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.765625

Example after disconnecting 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.41796875 [dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 996 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.990234375 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 2.3828125 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.375

And approx 1,5 minutes later the message changes to InstallSnapshot for 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.3125 [dotnext.raft.client.address=202;dotnext.raft.client.message=InstallSnapshot;Percentile=50] | 1000 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.171875 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.484375 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.6328125

sakno commented 2 months ago

~It seems like ConnectTimeout is not equal to 10ms, it's equal to 1000ms (as for RequestTimeout). How do you set ConnectTimeout in the code?~

Omg, I found a root cause. It's trivial, one-liner fix.

LarsWithCA commented 2 months ago

Awesome! :)

sakno commented 2 months ago

Could you check develop branch? It contains both fixes: incorrect usage of ConnectTimeout and accidental snapshot installation.

sakno commented 2 months ago

Also, the upcoming release introduces new WaitForLeadershipAsync method that waits for the local node to be elected as a leader of the cluster and returns leadership token. It is very convenient if your code relies on LeaderChanged event to determine, which node allows writes.

LarsWithCA commented 2 months ago

The issue seems to be fixed, i.e. the time of RaftCluster.ReplicateAsync immediately goes back to something very low (after disconnecting one node): image

This is great, thanks a lot! I'll keep my cluster running and get back to you tomorrow regarding the accidental snapshot installation.

LarsWithCA commented 2 months ago

@sakno the "snapshot installation" messages are also gone 👍

sakno commented 2 months ago

I'll prepare a new release today

sakno commented 1 month ago

Release 5.5.0 has been published.