Alachisoft / NCache

NCache: Highly Scalable Distributed Cache for .NET
http://www.alachisoft.com
Apache License 2.0
647 stars 123 forks source link

[Question] Best practices for dynamic cluster scaling and load balancing #34

Open killemth opened 5 years ago

killemth commented 5 years ago

In a very high transaction volume environment, are there best practices for dynamic scaling (adding or removing cluster nodes) to the NCache cluster based on a combination of either requests per second or cache event latency?

We have had instances where having 10 large static nodes still cause performance issues, mostly because highly accessed items would all be on the same node. Not sure what a remedy would be for this issue if anything

Aside from that, it would be ideal if there was a way to dynamically scale out nodes, maintaining full uptime (without incurring state transfer errors or disconnects on cache operations) when scaling out. In past versions, there would always be application issues when adding nodes and it would cause clients to disconnect and cache operations to fail.

Brad-NCache commented 5 years ago

Hi Killemth,

Thank you for your questions. I will answer each one in turn in the order in which the questions were put forth:

Question1: In a very high transaction volume environment, are there best practices for dynamic scaling (adding or removing cluster nodes) to the NCache cluster based on a combination of either requests per second or cache event latency?

Answer:

In case of node removal during periods of high volume traffic together with high network latency, it would be best to remove one node at a time and allow for NCache to re-adjust itself through its inner self healing cache cluster and data distribution mechanisms. NCache also comes with Graceful Node Stop feature that gracefully ceases client activity for the node currently being stopped. This ensures that survivor nodes are well informed around this and recovery logic is very stable in this case. The details of this Graceful Stop feature are given in the following link:

http://www.alachisoft.com/resources/docs/ncache/admin-guide/stop-cache.html

Question 2: We have had instances where having 10 large static nodes still cause performance issues, mostly because highly accessed items would all be on the same node. Not sure what a remedy would be for this issue if anything

Answer:

Please note that it is recommended to keep a ratio of 4:1 between the number of cache clients and number of cache servers in a cluster for the case of the NCache Partitioned and Partition-Replica (POR) cache topologies as these allow linear scalability for Reads as well as for Writes.

For NCache Replicated topology, however, it is recommended to keep 4 cache servers in the cache cluster at max as it uses the Sync replication option and having more NCache servers in the cluster may have negative impact on the Writes performance. More information on the different NCache cluster topologies and their features can be found at the following link:

http://www.alachisoft.com/ncache/caching-topology.html

In Partitioned and POR topologies, the items are distributed through an intelligent distribution algorithm that distributes the cache data among the NCache servers currently added in the cache cluster. NCache uses hashmaps for the cache keys that allow clients to map a given string cache key to an NCache Server in the cluster and there is an almost an even data distribution between the cache servers.

However, it is possible that you may be getting most frequently accessed data hashed on one server as a coincidence and seeing more requests on one server even though data is distributed evenly among the nodes as a whole. Moreover, if you see data is not evenly distributed (using the cache count on each server), then you can also review NCache Auto Data Load Balancing feature, the details of which are given in the following link:

http://www.alachisoft.com/resources/docs/ncache/admin-guide/manage-data-load-balancing.html

In Replicated cache, you have identical data replicated on all the NCache servers but NCache clients are balanced among them. This means that a given client application is only connected to one of the NCache servers. It is possible that client applications connected to one of the NCache Servers are generating more request loads than seen by the other servers or there are a slightly higher number of clients connected to one server than to the others. You can review NCache counters and also verify the number of clients connected to a server and furthermore observe if more client requests are sent to one server than the others.

You can also send an email at support@alachisoft.com to review this more through a working session.

Question 3: Aside from that, it would be ideal if there was a way to dynamically scale out nodes, maintaining full uptime (without incurring state transfer errors or disconnects on cache operations) when scaling out. In past versions, there would always be application issues when adding nodes and it would cause clients to disconnect and cache operations to fail.

Answer:

NCache allows for dynamically adding and removing nodes to a cluster with 100% response without any downtime or data loss (except for partitioned cache) and this is common to all NCache cluster topologies. Any time there is a node addition or removal, there is a period wherein data re-distribution takes place among the nodes in the current cluster configuration.This may take some time to complete depending on the data size and number of NCache servers in the cluster. This process is called State Transfer. State Transfer is done in the background which means client requests are always given the highest priority. Data re-distribution is done using buckets as units of transfer while this process is running and so there is a remote possibility of getting a "state transfer in progress" exception if you issue a request for the data that was in a bucket while it was in transit between the NCache servers. A simple retry of retrieving the same data after getting this exception should work without any problems.

For more information on how NCache deals with dynamic node addition and removal as well as failover scenarios, please follow the link given below:

http://www.alachisoft.com/ncache/dynamic-clustering.html