Closed LarsWithCA closed 6 months ago
(in ms) of RaftCluster.ReplicateAsync
It is a measurement of latency, not throughput. By calling ForceReplicationAsync
on every write, you choose better latency by the cost of low throughput. With ForceReplicateAsync
you don't need to wait for the next heartbeat round, that's why latency improves. However, if you want to have good throughput, you need to accumulate as much uncommitted log entries as possible and wait for replication. It can be done without explicit call of ForceReplicationAsync
. If you have just one writer, prefer latency over throughput.
Hi @sakno
I can see that we messed up the terms a bit 😊
From the wording in the changelog:
Now the leader waits for replication from majority of nodes instead of all nodes.
we were hoping to also see an improvement on latency (in case of inaccessible cluster minority). Are you saying that is not the case? Or are you saying we should just call ForceReplicationAsync
instead?
We have a single writer in our system (and we do prefer latency over throughput).
we were hoping to also see an improvement on latency
Not exactly, it's improvement over throughput from client perspective. However, the underlying state machine commits changes earlier. A small recap:
ForceReplicationAsync
resumes when all nodes replicated (or some of them detected as unavailable)AppendAsync
from WAL invokes before ForceReplication
because only majority is needed to mark log entries as committedIf you want to improve latency in presence of unavailable nodes, you can use the following techniques:
Reduce connect timeout
Our ConnectTimeout
is only 10ms. Did you mean another timeout? We have 'RequestTimeout' set to 1 second.
The leader exposes broadcast-time
counter in milliseconds. You can measure it with dotnet-counters
. What's the average value of this counter with and without unavailable nodes?
Average of broadcast-time
before/after disconnecting a node:
(Linux ARM + .NET8 + DotNext.Net.Cluster 5.4.0)
How many nodes are disconnected? 920 ms with 1 disconnected node?
A single disconnected node (out of a cluster of 6 in total).
Very suspicious, 920ms - 5ms
doesn't give 10ms
for connection timeout.
One more way to investigate the issue. There is response-time
counter that shows response time for each node individually. Could you share this value for each node? Every counter has a tag indicating IP address of the node (you can replace IP addresses with anything else, it's needed just to distinguish values).
A group of metrics is DotNext.Net.Cluster.Consensus.Raft.Client
, not DotNext.Net.Cluster.Consensus.Raft.Server
.
Example before disconnecting 202
:
[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.375 [dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.296875 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 10.375 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.671875 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.765625
Example after disconnecting 202
:
[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.41796875 [dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 996 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.990234375 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 2.3828125 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.375
And approx 1,5 minutes later the message changes to InstallSnapshot
for 202
:
[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.3125 [dotnext.raft.client.address=202;dotnext.raft.client.message=InstallSnapshot;Percentile=50] | 1000 [dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.171875 [dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.484375 [dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.6328125
~It seems like ConnectTimeout
is not equal to 10ms, it's equal to 1000ms (as for RequestTimeout
). How do you set ConnectTimeout
in the code?~
Omg, I found a root cause. It's trivial, one-liner fix.
Awesome! :)
Could you check develop
branch? It contains both fixes: incorrect usage of ConnectTimeout
and accidental snapshot installation.
Also, the upcoming release introduces new WaitForLeadershipAsync
method that waits for the local node to be elected as a leader of the cluster and returns leadership token. It is very convenient if your code relies on LeaderChanged
event to determine, which node allows writes.
The issue seems to be fixed, i.e. the time of RaftCluster.ReplicateAsync immediately goes back to something very low (after disconnecting one node):
This is great, thanks a lot! I'll keep my cluster running and get back to you tomorrow regarding the accidental snapshot installation.
@sakno the "snapshot installation" messages are also gone 👍
I'll prepare a new release today
Release 5.5.0 has been published.
This plots shows the timing (in ms) of RaftCluster.ReplicateAsync - at the vertical green line 1 node is disconnected (out of a cluster of 6 nodes in total): (Linux ARM + .NET6 + DotNext.Net.Cluster 4.14.1)
From the change log of DotNext.Net.Cluster 4.15.0:
This made us hope that we would no longer see these kinds of longer timings in case of inaccessible cluster minority. However, we see a pretty similar plot - at the green line 1 node is disconnected (out of a cluster of 6 nodes): (Linux ARM + .NET8 + DotNext.Net.Cluster 5.3.0)
Did we have wrong expectations, or are we doing something wrong?