TTL timeout exception during failover scenario

Hey there,

TTL timeout exception occurs during failover scenario. Looks like it tries to connect to failed master node.

This is possible duplicate of https://github.com/Grokzen/redis-py-cluster/issues/274 Please manage it yourself whether to close this or reopen that. I've put the same comment to that issue...

Scenario

Env

Linux OS
Redis Cluster: 3 master + 3 slave nodes that runs via Docker-Compose locally
python 3.6.10
redis-py version: 3.5.3
redis-py-cluster: 2.0.99 (the latest one, RC1)

Steps to reproduce

Start pretty dumb python script (see the archive in the attachment) which connects to Redis Cluster and sends and gets random messages infinitely: python rediscluster_failover_test.py
Wait a bit (15-30 seconds)
Stop one of masters in the Redis Cluster via docker-compose -f docker-compose-rediscluster.yaml stop redis-node-<master id>
Make sure your Redis Cluster turns one of slaves to master and RedisCluster status is OK

Expected result

Script keeps going

Actual result

Script gets stuck and exception is raised after a while: rediscluster.exceptions.ClusterError: TTL exhausted Internally it catches ConnectionError 16 times and then ClusterError: TTL

Additional Notes

Find the docker-compose redis cluster is in the attached archive as well.

Script sample output:

$ python rediscluster_failover_test.py                                 
Set key: 810b9260-7c38-42d9-845f-88f15f21750b with value rpzzetbihsgksyqycoznaqiqkkpurtoqrlxlfmofzwkpfyrzrltzhkkdvjxeoxtsc
Get key: 810b9260-7c38-42d9-845f-88f15f21750b :: Value = rpzzetbihsgksyqycoznaqiqkkpurtoqrlxlfmofzwkpfyrzrltzhkkdvjxeoxtsc
Set key: c07fb565-40c1-485c-8fdd-c8236bf21724 with value ztckhwermcsxkcycoyfzoawotimvitrjayolfuodeqmmd
Get key: c07fb565-40c1-485c-8fdd-c8236bf21724 :: Value = ztckhwermcsxkcycoyfzoawotimvitrjayolfuodeqmmd
Set key: 3ca32b14-cc0d-4b25-9ad6-00e7777f15c5 with value wshjjimrfsfuzdclnxdrboifpuixcicssxxmafsllkfignmjaxzbjzurtvdzsdrqeiqviaqiqiahpagrignxjglsndqlxkccphuzkblszyephyzp
Get key: 3ca32b14-cc0d-4b25-9ad6-00e7777f15c5 :: Value = wshjjimrfsfuzdclnxdrboifpuixcicssxxmafsllkfignmjaxzbjzurtvdzsdrqeiqviaqiqiahpagrignxjglsndqlxkccphuzkblszyephyzp
Traceback (most recent call last):
  File "rediscluster_failover_test.py", line 25, in <module>
    rc.set(key, msg)
  File "/home/venv/lib/python3.6/site-packages/redis/client.py", line 1801, in set
    return self.execute_command('SET', *pieces)
  File "/home/venv/lib/python3.6/site-packages/rediscluster/client.py", line 531, in execute_command
    return self._execute_command(*args, **kwargs)
  File "/home/venv/lib/python3.6/site-packages/rediscluster/client.py", line 652, in _execute_command
    raise ClusterError('TTL exhausted.')
rediscluster.exceptions.ClusterError: TTL exhausted.

Using the same env with the same dumb script logic with the same scenario but with another library (ioredis with node.js) works like a charm.

redis_failover_issue.zip

Update: Was able to reproduce it on non-dockerized, bare-metal environment

Performed the same test with v2.0.0 along with AWS Elasticache failover. While time to recovery from failover on Elasticache was well within a minute, it took around 3 minutes to resume writes by redis-py-cluster, although it did recover completely afterwards.

@Grokzen Any thoughts on reducing this time to recovery?

We see the same behavior. Any idea why it takes longer to recover @Grokzen ? Is there a way to fix this?

So what i see in the code was that there was an issue reported in #362 where the conclusion was that TimeoutError and ConnectionError had to be split into two different code paths in order to solve an issue there and a logical issue where the two different errors should have two different behaviors. c62012c7bfaf3ffe4f199de39085fe355fcc5737

So after that commit and after debugging a bit in the codebase and with your script @ofhellsfire what i got out is that when you put down a master node it will result in ConnectionError exceptions to be raised from below and after the mentioned commit above any connection error will now not do anything more then just to try again with the same setup and just get back into the ConnectionError exception the next loop. And this is why you see the TTL exhausted error being raised in the end as the loop will just continue for 16 attempts and then fail our with the error you posted above. Also i noticed that in some instances there will be a ClusterDown error raised if your client happens to make a query just in the right time when the overall quorum of the cluster is set to Down/Failed state and you hit that. In that case you will actually simply rebuild the cluster clients internal state in that exception code path and your next attempt will probably be successfull if your redis cluster has reached a new quorum and elected a new master for that shard and set itself into OK cluster state again.

One possible solution to this would be to make a kind of reinitialize counter step on ConnectionErrors as the retry mechanism should still remain there for a node that might not be reachable for whatever reason for a short period of time.

The following example fix will probably solve the problem as now there can be up to 5 connection errors in a row for a single command over the span of 0.5 seconds before the client attempts to rebuild the slots cache from the server again.

-            except ConnectionError:
+            except ConnectionError as e:
+                log.exception("ConnectionError")
+
                 connection.disconnect()
-            except TimeoutError:
+                connection_error_retry_counter += 1
+
+                # Give the node 0.1 seconds to get back up and retry again with same
+                # node and configuration. After 5 attempts then try to reinitialize
+                # the cluster and see if the nodes configuration has changed or not
+                if connection_error_retry_counter < 5:
+                    time.sleep(0.1)
+                else:
+                    self.refresh_table_asap = True
+                    self.connection_pool.nodes.increment_reinitialize_counter(
+                        count=self.connection_pool.nodes.reinitialize_steps,
+                    )
+
+            except TimeoutError as e:
+                log.exception("TimeoutError")
+

Note that regarding the AWS elasticcache service and this, since the node rebuild and the management of the nodes can be slower to reach quorum then 0.5 seconds you can still in the end reach the TTLExchaustionError if your cluster do not reach a new cluster setup in time. The fix above will just ensure that the client will at least attempt and try to reinitialize the cluster with a new nodes/slots setup from the cluster before failing out. If you for example has some big cluster or many nodes that failover and similar events would be slow then you might still fail out from the client code with the same error, but the client can't be made to wait forever for your cluster so it needs to fail out at some point.

As a example output from a failed or very slow cluster rebuild i got the following output from a run where it randomly or for a reason i dont know took almost 20 seconds for it to recover when i closed down a master node. So when you speak of AWS elasticcahe slow rebuilds @aayush-plivo i do think that this is what you are seeing but just on a bigger scale in the cloud where things might run slower or the re-election process might just take a much longer time to perform. But like you said yourself, once the cluster reached a new solid quorum and it was stable, then the clients would start to process data again. In a way i would say that things work properly from the client pov and that it is a specific issue with AWS ElasticCache being slow. I bet if you check your logs and internal data like this below you will probably find something similar to this below that the cluster is taking a long time to reach quorum.

redis-cluster_1  | ==> /var/log/supervisor/redis-6.log <==
redis-cluster_1  | 262:M 06 Sep 2020 11:18:28.548 * Marking node 632f9c7f19d323f0c1aefb428df0039dcf96c00a as failing (quorum reached).
redis-cluster_1  | 262:M 06 Sep 2020 11:18:28.548 # Cluster state changed: fail
redis-cluster_1  |
redis-cluster_1  | ==> /var/log/supervisor/redis-1.log <==
redis-cluster_1  | 280:S 06 Sep 2020 11:18:30.363 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:30.363 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:30.363 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:31.368 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:31.369 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:31.369 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:32.373 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:32.373 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:32.374 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:33.379 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:33.380 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:33.380 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:34.385 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:34.385 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:34.385 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:35.389 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:35.389 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:35.389 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:36.392 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:36.392 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:36.392 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:37.298 # Currently unable to failover: Waiting for votes, but majority still not reached.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:37.398 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:37.399 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:37.399 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:37.901 # Currently unable to failover: Failover attempt expired.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:38.404 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:38.404 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:38.404 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:39.410 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:39.410 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:39.410 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:40.415 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:40.416 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:40.416 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:41.421 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:41.422 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:41.424 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:42.430 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:42.430 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:42.430 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:43.434 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:43.434 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:43.435 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:44.440 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:44.441 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:44.441 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:45.447 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:45.447 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:45.447 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:46.451 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:46.451 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:46.451 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:47.455 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:47.455 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:47.455 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:47.957 # Start of election delayed for 674 milliseconds (rank #0, offset 26648).
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.058 # Currently unable to failover: Waiting the delay before I can start a new failover.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.460 * Connecting to MASTER 172.18.0.2:7004
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.460 * MASTER <-> REPLICA sync started
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.460 # Error condition on socket for SYNC: Operation now in progress
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.662 # Starting a failover election for epoch 19.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.676 # Currently unable to failover: Waiting for votes, but majority still not reached.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.679 # Failover election won: I'm the new master.
redis-cluster_1  | 280:S 06 Sep 2020 11:18:48.679 # configEpoch set to 19 after successful failover
redis-cluster_1  | 280:M 06 Sep 2020 11:18:48.679 # Setting secondary replication ID to 11f4a9ceac9074e97bc0c2223d070242da1880ec, valid up to offset: 26649. New replication ID is f93664cf20b76a25de0a1be14b0abd3ab18f291c
redis-cluster_1  | 280:M 06 Sep 2020 11:18:48.679 * Discarding previously cached master state.
redis-cluster_1  | 280:M 06 Sep 2020 11:18:48.679 # Cluster state changed: ok
redis-cluster_1  |
redis-cluster_1  | ==> /var/log/supervisor/redis-2.log <==
redis-cluster_1  | 271:S 06 Sep 2020 11:18:48.714 # Cluster state changed: ok
redis-cluster_1  |
redis-cluster_1  | ==> /var/log/supervisor/redis-3.log <==
redis-cluster_1  | 62:M 06 Sep 2020 11:18:48.672 # Failover auth granted to c587b5b1a13966c5e20853bc8abc5615d26f1004 for epoch 19
redis-cluster_1  | 62:M 06 Sep 2020 11:18:48.714 # Cluster state changed: ok

Did a quick test of your latest fix regarding that TTL Exception @Grokzen. Looks good. At least I wasn't able to get the same error and no lost messages. Good job!

Made a simple script to test redis cluster failover: https://github.com/ofhellsfire/redis-cluster-dumb-tests

The results of that script:

Key: 0; Message: 2020-09-15 14:54:31.762962
Key: 1; Message: 2020-09-15 14:54:33.766543
Key: 2; Message: 2020-09-15 14:54:35.771643
Key: 3; Message: 2020-09-15 14:54:37.776282
Key: 4; Message: 2020-09-15 14:54:39.780504
Key: 5; Message: 2020-09-15 14:54:41.784550
Key: 6; Message: 2020-09-15 14:54:43.788335
Key: 7; Message: 2020-09-15 14:54:45.792425   <<< It stuck here when one of the masters gets down, but was able to recover
Key: 8; Message: 2020-09-15 14:56:05.212801
Key: 9; Message: 2020-09-15 14:56:07.217364
...                                           <<< Here master is brought up back, all is good
Key: 56; Message: 2020-09-15 14:57:41.464096  <<< Slave gets down, all is good
Key: 57; Message: 2020-09-15 14:57:43.468401  <<< Slave is brought up back, all is good
Key: 58; Message: 2020-09-15 14:57:45.471797
...

Took a look at the fix itself. @Grokzen, does it make sense to have more control over reconnection for different environments (slow hw, bad networks, etc)? What I'm talking about is the following lines: https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/client.py#L647-L648 which contain hardcoded values. Does it makes sense to pull them into the class initializer, so any user can initialize the client in a suitable way?

@ofhellsfire Yes that is a good suggestion acctually, there is several other internal configuration variables that can be set elsewhere in the code so i probably will add it to some kind of configuration option to them so the user can decide what values to have if they want to tweak them. I will do some bigger work on all of them and move all of the timers and timeout counters into a single configuration dict in the class probably. I probably want to go away from having kwargs options into the class for all of them and just put them in a dict like how the RESULT_CALLBACK dict works in the class. Will see what i figure out but i will implement your suggestion.

As a example output from a failed or very slow cluster rebuild i got the following output from a run where it randomly or for a reason i dont know took almost 20 seconds for it to recover when i closed down a master node.

Just in case redis cluster failover timeout depends on the cluster configuration. Specifically: cluster-node-timeout cluster-slave-validity-factor All is described here: https://redis.io/topics/cluster-tutorial

Aha, makes sense yes, i can't really use the values in the client for anything so it would be up to the user to actually tweak the values in the end on the client side to what the user likes to have for their specific case. Also the configuration lookup would fail in cases like AWS ElasticCache where the config option is blocked off for some reason from the client side

@ofhellsfire Can i close this issue as resolved? The configuration options thingy i talked about above i can do in another separate issue instead of having to wait for it to be done in here.

@ofhellsfire Can i close this issue as resolved? The configuration options thingy i talked about above i can do in another separate issue instead of having to wait for it to be done in here.

@Grokzen Sure! Thanks for the fix!

Grokzen / redis-py-cluster