feedback/input on tuning of parameters

freddyrios commented 2 years ago

To get the modified example running stable, we have modified the following parameters:

(stability in target deployment) doubled the lower and upper election timeouts in the example
(stability in target deployment) increased by 1k times the TransmissionBlockSize, BufferSize and InitialPartitionSize

Note these parameter changes were needed for stability before any changes to the original example write speed, that is, writting an entry once x second.

The thought behind the update for the example was simple: we have entries that are 1k times bigger than the original example, so the values were increased by 1k. Does that make any sense? if not, what are good ways to reason about these parameters and what would be some good values to try for the example.

If one uses ping between the devices there is sub millisecond latency. Any ideas what could be behind needing to run with an election timeout range of 300 - 600ms?

In the 3 node deployment, the nodes are seeing approx. 700 big entries / second (1k every 1.4 seconds). In reality, this is the flow of messages we expect to see:

each node gathers and shares sensor data with the leader. This can vary by deployment, but assume up to 70 8-byte values each (we are considering lowering it to 4 bytes).
nodes share data 10 times x second, so about 700 8-byte values x second (so close to only 1 of the big entries).
full deployment is 16 nodes, so less than 16 of those big entries
initially, 10 times x second there will one big entry that consolidates the smaller ones + include actuation outputs
additional we expect some data flow related to the commands and application state
on the long term we will see additional data flows, but no need to consider this for now

So, 26 of the big entries + commands/application state + future flow for fault tolerance should end up well below 70 big entries, which is 10 times less than the rate we are seeing on the modified test. Based on this info, how would it make sense to tune the parameters for this smaller data flow (with a bigger amount of nodes, which also means nodes the leader would be sending info to many more systems).

sakno commented 2 years ago

Assuming you use TCP transport:

Review a format for log entries. 8K for a single log entry probably too large. Also use efficient serialization/deserialization mechanism
Set TransmissionBlockSize to a approx size of a single log entry
Set BufferSize to approx log entry size
Use Incremental compaction mode. In this case, the size of partition doesn't matter and has no effect on performance.
Set InitialPartitionSize to recordsPerPartition X approx log entry size
Set MaxConcurrentReads to the expected number of cluster members
With enough RAM, set CacheEvictionPolicy property to LogEntryCacheEvictionPolicy.OnSnapshot
Timeouts: ConnectTimeout < RequestTimeout < LowerElectionTimeout. ConnectTimeout is a small value that should be enough to establish TCP connection

freddyrios commented 2 years ago

Some exploration done with the example using fast speed branch + flushoncommit + single replication x 16 entries + connection timeout of 5 ms (times recorded is the time for 1k entries being applied to a node):

8k bytes x entry: 6.7-6.9 seconds
4k bytes x entry: 4.8-5.0 seconds
2k bytes x entry: 3.6-3.7 seconds
32K transmission & buffer: 5.8-6.3 seconds
32K transmission & buffer + 10 records x partition: 6.4-6.7 seconds
32K transmission & buffer + 10 records x partition + setting initial partition size: 6.0-6.2 seconds
32K transmission & buffer + 10 records x partition + setting initial partition size + eviction policy: 6.0-6.2
32K transmission & buffer + 50 records x partition + setting initial partition size + eviction policy: 5.0-5.1
4k bytes x entry + # 7 configs (halving the numbers correspondingly or not made little difference): 3.9-4.0 seconds
2k bytes x entyr + # 8 configs (without changes): cluster kept giving failures
2k bytes x entyr + # 8 configs halving the corresponding numbers: 3.4-3.6

Did not explore MaxConcurrentReads because for this setup it should already be set to 3.

Side question: is this type of CPU usage difference expected on the leader vs. followers?

Regarding the errors in point 9, some of the errors said fail in red with this:

fail: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74028]
      Failed to process request from 169.254.241.37:40800
      System.IO.EndOfStreamException: Attempted to read past the end of the stream.
         at RaftNode.SimplePersistentState.UpdateValue(LogEntry entry) in C:\Users\fredd\source\repos\dotNext\src\examples\RaftNode\SimplePersistentState.cs:line 87
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 456

Regarding the actual entries:

in total across the systems there will easily be 1000 different values, which are currently all stored as double (so 8 bytes each). First planned reduction being considered is moving all to float and later some fields can be typed more specifically.

sakno commented 2 years ago

That's normal CPU utilization if the loop doesn't have a delay.

freddyrios commented 2 years ago

re CPU utilization, I just mean the 2 followers seem to have 1 core at 100%, while the leader does not. That picture of CPU usage remains if I change the leader, its always the leader without any super busy core vs. the followers with a core at 100%.

So another way to phrase the question is whether its normal that the followers only seem to be using a single core for some significant part of the work.

sakno commented 2 years ago

No, I'm not using CPU affinity. Just two things: await and ThreadPool.QueueUserWorkItem.

freddyrios commented 2 years ago

Additional findings based on shared and individual additional explorations:

the 100% CPU utilization in followers was due to a tight loop in the DataModifier of the example checking the leadership status. Adding Thread.Sleep or await Task.Delay removes the high utilization but adds close to 1 ms of latency x entry due to unknown reasons (happens even when using 10 seconds on the sleep/delay), so final measurements did not include this change.
improvements focused on reducing latency without parallelization, which is now down to approx. 10 ms for the 8k bytes entries.
earlier measurements were trying to focus on throughput by using parallelization (which tends to reduce the impact of latency). New tests did not see an improvement in this area and if anything got slightly slower i.e. point 5 in earlier test was doing 5 ms x entry (5.0 seconds for 1k entries) while latest tests measured 5.7ms x entry. Latency for a given entry was not measured in the earlier test.
alternate ways of parallelization or saving more entries x replication call did not show better results (more top level tasks as well as an attempt using the AppendAsync overload that allows saving multiple entries at a time).
parallelization with 2k bytes showed 4.5 ms vs. 3.4 ms from earlier test.

Conclusions: it does not seem currently possible in this version to use parallelization to achieve an order of magnitude more throughput. Having 1 bigger entry instead of multiple smaller entries is thus preferable, which for us means an unified entry with the inputs of all nodes instead of a smaller entry x node. This approach is already the one considered in the pull based approach recommended by @sakno, so no further action is necesary for this.

copenhagenatomics / dotNext

feedback/input on tuning of parameters #2