Bug(ConnectionResetError): handling causes MovedErrors and expensive NodeManager rebuilds

Through a lot of debugging I came to the same conclusion as issue #333 , not trying to duplicate the issue, but I think I can do a better job of explaining what's going on here, and how I came to a conclusion as to the root cause.

TL;DR before the wall of text: A ConnectionResetError: [Errno 104] Connection reset by peer at the socket level gets handled as a ConnectionError in the Client object, and in the handler we set try_random_node, however what we actually need to do is retry the same node. Upon trying a random node we get a MovedError which causes the NodeManager to reinitialize which can be very expensive.

Taking a step back, be me a few weeks ago, wrapping up a nice new feature, doing some performance testing. I noticed that about 2% of the time my feature's runtime was over 1s, sometimes up to 3s. Effectively we're just grabbing a bunch of data from redis and performing some very simple transformations. On the quick side of things runtimes were as low as 10ms. What could possibly explain these extremely high outlier values?

A couple notes before we get any further, because otherwise some of the numbers you see below may be confusing. This feature is being tested in containers constrained to 100m of cpu and 100MiB of ram. Our redis cluster is located in a single data center, but these containers can be in any data center. Average time to ping redis from within the same data center is 0.5ms, from other data centers it’s about 35ms. Moving on.

Next step is to go down and look at the performance time for each redis command and the payload size, in case larger payloads = longer time.

Here’s a sample outlier case:

realtime performance (rc.hgetall): 1.195277 
realtime performance (rc.lrange): 0.000592 
realtime performance (rc.lrange): 0.000815 
realtime performance (feature): 1.198617

Here’s a sample normal case:

realtime performance (rc.hgetall): 0.067613 
realtime performance (rc.lrange): 0.069521
realtime performance (feature): 0.144102

What we see is that when the entire feature takes over 1s, all that latency usually falls one command. I was also able to find examples where the exact same redis command was being executed on two different instances, returning the same payload, one takes over 1s, while the other is in the 1-115ms range.

Next step we go even deeper, let’s run cProfile on each redis command, and see what’s happening differently in an outlier. Here’s an outlier case (I will cut out most of the profiling output, it’s quite dry):

profiling_output: "         84261 function calls (84108 primitive calls) in 2.086 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.409    0.409    2.084    2.084 /usr/local/lib/python3.6/site-packages/rediscluster/nodemanager.py:142(initialize)
    16384    1.103    0.000    1.104    0.000 /usr/local/lib/python3.6/site-packages/rediscluster/nodemanager.py:197(<listcomp>)
    16394    0.263    0.000    0.456    0.000 /usr/local/lib/python3.6/site-packages/rediscluster/nodemanager.py:277(make_node_obj)
    16394    0.193    0.000    0.193    0.000 {method 'format' of 'str' objects}
    16409    0.095    0.000    0.095    0.000 {method 'append' of 'list' objects}
        2    0.000    0.000    0.016    0.008 /usr/local/lib/python3.6/site-packages/redis/client.py:748(execute_command)

As you can see, in the outlier case almost all the time is spent reinitializing the NodeManager. This proved to be true in all the outlier cases I looked at. So the next question: Why is this object being rebuilt? Tracing through the code we can see that on a MovedError:

self.refresh_table_asap = True
self.connection_pool.nodes.increment_reinitialize_counter()

So by default every 25 MovedErrors results in a reinitialization, but also a single MovedError results in the NodeManager being rebuilt before the next command is run. This by itself seems like a bug, why enable reinitialize_steps to be set in the constructor if we’re just going to ignore it and reinitialize on the next command anyways?

Next question: Why are we seeing any MovedErrors period? We don’t go around resharding our staging redis cluster, so there must be something else going on. My next course of action was to investigate what’s going on inside the NodeManager. Long story short, I didn’t find anything wrong, the NodeManager is solid.

Ok well now I have a bunch of extra logging in the NodeManager, lets log out these moved errors and see what we get. Additionally I logged out the slot returned by the NodeManager and the repr() of the connection object returned from the ConnectionPool we are executing the command against:

rediscluster.exceptions.MovedError: 11566 0.0.0.12:6379
connection: "ClusterConnection<host=0.0.0.2,port=6379,db=0>" 
slot: 11566

Well obviously we are getting a real MovedError as we are executing against the wrong node in our cluster. However the NodeManager returned the correct slot, so what gives?

Might as well log out all the different exceptions that could also be happening here. To my surprise I see that in almost 50% of cases we see at least one of those following errors:

"Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 190, in _read_from_socket
    data = recv(self._sock, socket_read_size)
  File "/usr/local/lib/python3.6/site-packages/redis/_compat.py", line 71, in recv
    return sock.recv(*args, **kwargs)
ConnectionResetError: [Errno 104] Connection reset by peer

Which gets bubbled up as the following redis error:

redis.exceptions.ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')"

Which in turn triggers the Client to retry on a random node. Upon retrying said node we have a 5 in 6 chance of getting the wrong node, in which case we get a MovedError, which in turn triggers the costly rebuilding of the NodeManager. I hope you can understand my confusion, but why would we retry our command on a random node when we know exactly which node we need to retry on? Perhaps we are an odd case for seeing so many connection interruptions, but in our staging and production environments we probably see hundreds of thousands, maybe even millions of these ConnectionResetErrors, and not once has it ever been because the node we were communicating with had gone down.

So onto the important part: how do we fix this? Well I think the solution in https://github.com/Grokzen/redis-py-cluster/pull/335 certainly works, it’s almost identical to the solution I came up with, and I have since tested this exact change in our staging environment, and have seen 0 instances of given commands taking over a second.

An alternative option that I would propose, if you feel strongly that trying a random node is sometimes still the correct approach to handling this error, perhaps some sort of retry handler with backoff and jitter that tries to correct node say 5 times before trying a random node, might also be an appropriate way to fix this bug. I will also investigate whether or not there is a different error for a node going down than this ConnectionResetError, another option maybe to just handle them separately.

Another alternative is to make trying a random node a configurable option in the Client's constructor, because I’m sure there are other users, like us, where this behaviour will never be needed, and is just causing these same performance anomalies.

Thank you for taking the time to read this. Please let me know if you have any questions.

Also if it would be more useful for me to throw in a PR with my version and then to go fix the tests and be present to get it polished up and merged, let me know, I don't mind spending the time to get this fix in.

Grokzen / redis-py-cluster

Bug(ConnectionResetError): handling causes MovedErrors and expensive NodeManager rebuilds #362