tahamr83 commented 2 years ago

Summary

We have a redis cluster that runs on kubernetes, after one of our nodes crashed the StatefulSet created a new pod rendering the POD IP of the redis node unroutable. The redis-py client was for some reason unable to obtain the new state of the cluster and we see a lot of TTL exhausted errors.

Even if the dead unroutable redis node is selected, shouldn't it recover the cluster state and remove the dead node from it's list? Instead what we see is all 16 TTL attempts select the exact same node and finally we see a TTL Exhausted error.

[2021-10-01 13:35:36 DEBUG    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - TTL loop : 15
[2021-10-01 13:35:36 DEBUG    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - Determined node to execute : {'host': '10.244.7.123', 'port': 6379, 'name': '10.244.7.123:6379', 'server_type': 'master'}
[2021-10-01 13:35:39 ERROR    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - ConnectionError
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/rediscluster/client.py", line 630, in _execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 113 connecting to 10.244.7.123:6379. No route to host.

[2021-10-01 13:35:36 DEBUG    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - TTL loop : 3
[2021-10-01 13:35:36 DEBUG    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - Determined node to execute : {'host': '10.244.7.123', 'port': 6379, 'name': '10.244.7.123:6379', 'server_type': 'master'}
[2021-10-01 13:35:39 ERROR    rediscluster.client TaInZR2tRRDy_Bi3hLmSbw..] - ConnectionError
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/usr/local/lib/python3.7/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
OSError: [Errno 113] No route to host

Why does the client select the dead 10.244.7.123:6379 node every time instead of trying another live random node and then fetching the new cluster state?

Grokzen commented 2 years ago

Even if the dead unroutable redis node is selected, shouldn't it recover the cluster state and remove the dead node from it's list? Instead what we see is all 16 TTL attempts select the exact same node and finally we see a TTL Exhausted error.

I am guessing here but the issue is probably that your redis-cluster is not booting out the node that you want because this client is coded so that it will only use what the redis server says is the current cluster state, if your node is not booted out it will remain even if the node is not reachable, that is just the reference logic a client should implement.

You probably can verify this by both checking what cluster info, cluster slots & cluster nodes is returning out when a node drops and try to monitor the master nodes how long it takes for the cluster consensus to reach a new consensus that eventually will propagate to all clients.

Also note that if you check the client code in the ConnectionError section here https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/client.py#L647 you will see down here https://github.com/Grokzen/redis-py-cluster/blob/master/rediscluster/client.py#L660 that the code shuold attempt a full node table refresh after 5 connection errors and do a full reinitialize of the cluster state and then it circles back to the initial point that your cluster has not reached a new cluster consensus and that is the root cause of your issue.

tahamr83 commented 2 years ago

Thank you so much for your analysis. Apologies for opening an unnecessary issue. This seems to be a cluster problem rather than the client issue.

Grokzen / redis-py-cluster

Unreachable node gets selected everytime causing TTL exhausted #484

Summary