Grokzen / redis-py-cluster

Python cluster client for the official redis cluster. Redis 3.0+.
https://redis-py-cluster.readthedocs.io/
MIT License
1.1k stars 315 forks source link

Bunch of exceptions occured when `scan_iter()` method is called during failover #389

Open ofhellsfire opened 4 years ago

ofhellsfire commented 4 years ago

Issue Description: Bunch of exceptions occured (redis.exceptions.ConnectionError is one of those) when scan_iter() method is called during failover.

Scenario

Env

Steps to reproduce

Expected result Script proceeds without exceptions.

Actual Result Script gets stuck and there are a bunch of exceptions are raised.

Output

...
Key: 8: 2020-09-15 17:16:54.135722
Key: 13: 2020-09-15 17:16:54.135748
Key: 49: 2020-09-15 17:16:54.135776
Traceback (most recent call last):
  File "rediscluster_failover_scan_iter_test.py", line 25, in <module>
    for key in rc.scan_iter(match='*', count=10):
  File "/home/venv/lib/python3.6/site-packages/rediscluster/client.py", line 969, in scan_iter
    raw_resp = conn.read_response()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 324, in read_response
    raw = self._buffer.readline()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 256, in readline
    self._read_from_socket()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 201, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
Key: 45: 2020-09-15 17:16:54.139949
...
Key: 49: 2020-09-15 17:16:54.140345
Traceback (most recent call last):
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "rediscluster_failover_scan_iter_test.py", line 25, in <module>
    for key in rc.scan_iter(match='*', count=10):
  File "/home/venv/lib/python3.6/site-packages/rediscluster/client.py", line 967, in scan_iter
    conn.send_command(*pieces)
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/home/venv/lib/python3.6/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 113 connecting to 172.19.0.5:6379. No route to host.
Key: 45: 2020-09-15 17:18:01.548966
...
ofhellsfire commented 4 years ago

@Grokzen Just in case I'm sharing how we fixed/patched this issue for ourselves temporarily:

    def scan_iter(self, match=None, count=None, _type=None):
        success_flg = False
        retry_count = -1
        while (not success_flg and retry_count < self.cluster_down_retry_attempts):
            try:
                yield from self._scan_iter(match, count, _type)
                success_flg = True
            except ConnectionError:
                self.connection_pool.disconnect()
                self.connection_pool.nodes.reset()
                retry_count += 1
                if retry_count < self.cluster_down_retry_attempts:
                    time.sleep(self.cluster_down_retry_timeout)
                else:
                    raise

...
    # original scan_iter()
    def _scan_iter(self, match=None, count=None, _type=None):
        ...

I know this is not optimal (you've written about it somewhere), but we just needed a quick fix since we cannot afford the redis client failure from one side, from the other side handling failure in the app code would be not optimal for the long run.