rediscluster.exceptions.ClusterError: TTL exhausted

hyderaliva commented 2 years ago

Hi,

We have 6 node cluster with 3 master & their respective slaves setup in zig-zag manner on three servers to sustain single node failure for redis V5.0.3. The complete setup is configured manually on Ubuntu host system. We had outage on our environment where one of the redis cluster node went down and as expected cluster rebuilt by promoting slave node but the application performance got affected wherein some of the API's failed with "ClusterError: TTL exhausted". The data size is around 15GB's distributed among 3 nodes and we are using redis-py-cluster V2.1.0 for our setup.

What would be the reason for cluster error.

Thanks

Grokzen commented 2 years ago

@hyderaliva To find out what exact exception you are getting, i highly recommend that you enable exception logging to get the log messages from all exceptions in the execute_command method here https://github.com/Grokzen/redis-py-cluster/blob/2.1.0/rediscluster/client.py#L628 and with that you can narrow down exactly why you are getting an error. You have log messages for each known and expected kind of error and depending on what you have, you can then find out why your client fails.

So what happens in most case:s is the following, if your cluster goes down or rebuilds very quickly by promoting the slave node to a master node then the client instance should figure this out itself when it can't connect to the master node it wants and try to rebuild your cluster by asking the other nodes. If however this process is to slow due to any reason at all, you will start to have some timeouts in some of your API calls as the client only attempts no more then 16 times before throwing this TTL exhausted error. One reason could be that the cluster have not elected up a new leader in time and fast enough for your clients to adjust and rebuild the cluster nodes cache properly.

In most cases no matter what, you have to sort these kinds of errors out in your own code and figure out if you are going to wait and retry or to fail up in your stack depending on what you build and operate. This client already tries to sort it out a little bit but it can't try or wait forever so it is up to the client user to figure out what to do in situations of timeouts and cluster issues within their own code space. Right now exception logging is the best case for you, but in almost all case:es the root cause for the problem is really within your redis cluster and how it operates and works.

hyderaliva commented 2 years ago

Hi @Grokzen,

Thanks for the input, as suggested will try again by enabling exception logging.

As mentioned earlier we have 3 master & 3 slave node cluster, only 3 master nodes are mentioned in the application config. During node failure the respective slave becomes new master , cluster rebuild with new node and all the slots covered as expected however some of the application endpoint are still referring to previous master and getting failed.

How to prevent application from accessing such failed Redis nodes and do we need to add all master and slave nodes to application config.

Thanks,

Grokzen commented 2 years ago

@hyderaliva aha, it is that problem. One of the main problems with redis cluster is just that you must point to some startup node to access the cluster. In most case:s you point to an IP, but in most environments if your entire node goes down and you bring a new one up like in a docker based env, you get a totally new IP and after some time that will drift enough that all nodes have new iP:s. One simple solution to this on a devops side of things is to migrate to DNS names for each node and update that dns name as nodes shift around. I dont know tho how you would move around when failovers happen, but i think that you can start a redis cluster as long as you initially talk to at least one node master or slave. This solution however requires a bit more complex devops setup to begin with, but it solves the issues with your clients. If you use a cloud solution for a redis cluster, you usually get this OOB as the managed solution keeps track of all nodes for you and keeps updated DNS records for each node that you can access or talk to.

One other possible solution is to have a separate load balancer like nginx or haproxy with a virtualip setup and you define all nodes in there and just round robin for each connection and only use that endpoint as a cluster discovery mechanism. It would probably reduce the number of failed connections and if the load balancer software is smart enough it might also rotate out nodes that is no longer reachable and always gives back a working connection to a node in the cluster.

hyderaliva commented 2 years ago

Hi @Grokzen,

Thanks for the input, change of IP address won't be an issue for us as we have setup Redis cluster on host system where IP address will be static unlike docker.

So only master node reference will be sufficient for application to discover entire Redis cluster,Right?

Thanks

Grokzen / redis-py-cluster

rediscluster.exceptions.ClusterError: TTL exhausted #506