copenhagenatomics / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
0 stars 0 forks source link

feedback/input on tuning of parameters #2

Closed freddyrios closed 2 years ago

freddyrios commented 2 years ago

To get the modified example running stable, we have modified the following parameters:

Note these parameter changes were needed for stability before any changes to the original example write speed, that is, writting an entry once x second.

The thought behind the update for the example was simple: we have entries that are 1k times bigger than the original example, so the values were increased by 1k. Does that make any sense? if not, what are good ways to reason about these parameters and what would be some good values to try for the example.

If one uses ping between the devices there is sub millisecond latency. Any ideas what could be behind needing to run with an election timeout range of 300 - 600ms?

In the 3 node deployment, the nodes are seeing approx. 700 big entries / second (1k every 1.4 seconds). In reality, this is the flow of messages we expect to see:

So, 26 of the big entries + commands/application state + future flow for fault tolerance should end up well below 70 big entries, which is 10 times less than the rate we are seeing on the modified test. Based on this info, how would it make sense to tune the parameters for this smaller data flow (with a bigger amount of nodes, which also means nodes the leader would be sending info to many more systems).

sakno commented 2 years ago

Assuming you use TCP transport:

  1. Review a format for log entries. 8K for a single log entry probably too large. Also use efficient serialization/deserialization mechanism
  2. Set TransmissionBlockSize to a approx size of a single log entry
  3. Set BufferSize to approx log entry size
  4. Use Incremental compaction mode. In this case, the size of partition doesn't matter and has no effect on performance.
  5. Set InitialPartitionSize to recordsPerPartition X approx log entry size
  6. Set MaxConcurrentReads to the expected number of cluster members
  7. With enough RAM, set CacheEvictionPolicy property to LogEntryCacheEvictionPolicy.OnSnapshot
  8. Timeouts: ConnectTimeout < RequestTimeout < LowerElectionTimeout. ConnectTimeout is a small value that should be enough to establish TCP connection
freddyrios commented 2 years ago

Some exploration done with the example using fast speed branch + flushoncommit + single replication x 16 entries + connection timeout of 5 ms (times recorded is the time for 1k entries being applied to a node):

  1. 8k bytes x entry: 6.7-6.9 seconds
  2. 4k bytes x entry: 4.8-5.0 seconds
  3. 2k bytes x entry: 3.6-3.7 seconds
  4. 32K transmission & buffer: 5.8-6.3 seconds
  5. 32K transmission & buffer + 10 records x partition: 6.4-6.7 seconds
  6. 32K transmission & buffer + 10 records x partition + setting initial partition size: 6.0-6.2 seconds
  7. 32K transmission & buffer + 10 records x partition + setting initial partition size + eviction policy: 6.0-6.2
  8. 32K transmission & buffer + 50 records x partition + setting initial partition size + eviction policy: 5.0-5.1
  9. 4k bytes x entry + # 7 configs (halving the numbers correspondingly or not made little difference): 3.9-4.0 seconds
  10. 2k bytes x entyr + # 8 configs (without changes): cluster kept giving failures
  11. 2k bytes x entyr + # 8 configs halving the corresponding numbers: 3.4-3.6

Did not explore MaxConcurrentReads because for this setup it should already be set to 3.

Side question: is this type of CPU usage difference expected on the leader vs. followers? image

Regarding the errors in point 9, some of the errors said fail in red with this:

fail: DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer[74028]
      Failed to process request from 169.254.241.37:40800
      System.IO.EndOfStreamException: Attempted to read past the end of the stream.
         at RaftNode.SimplePersistentState.UpdateValue(LogEntry entry) in C:\Users\fredd\source\repos\dotNext\src\examples\RaftNode\SimplePersistentState.cs:line 87
         at DotNext.Net.Cluster.Consensus.Raft.MemoryBasedStateMachine.ApplyAsync(Int32 sessionId, Int64 startIndex, CancellationToken token) in C:\Users\fredd\source\repos\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\MemoryBasedStateMachine.cs:line 456

Regarding the actual entries:

sakno commented 2 years ago

That's normal CPU utilization if the loop doesn't have a delay.

freddyrios commented 2 years ago

re CPU utilization, I just mean the 2 followers seem to have 1 core at 100%, while the leader does not. That picture of CPU usage remains if I change the leader, its always the leader without any super busy core vs. the followers with a core at 100%.

So another way to phrase the question is whether its normal that the followers only seem to be using a single core for some significant part of the work.

sakno commented 2 years ago

No, I'm not using CPU affinity. Just two things: await and ThreadPool.QueueUserWorkItem.

freddyrios commented 2 years ago

Additional findings based on shared and individual additional explorations:

Conclusions: it does not seem currently possible in this version to use parallelization to achieve an order of magnitude more throughput. Having 1 bigger entry instead of multiple smaller entries is thus preferable, which for us means an unified entry with the inputs of all nodes instead of a smaller entry x node. This approach is already the one considered in the pull based approach recommended by @sakno, so no further action is necesary for this.