Closed kylepad closed 4 years ago
Hi, i merged the related MR that would fix this problem. One commend i got tho when i read up on this all was the thing with refresh_table_asap=True
being set inside MovedError exception. I think the implementation of that value was done in the way it was done in order to allow for the next loop to try to force to find the master node while inside the same command you just executed. I am wondering if the nodes is being refresh becuase of the node initialize that happens before it gets into the while loop due to the value not being reset after it has been used by the inside of the while loop. So what i am thinking is that by resetting the flag when it is being used inside the while loop might just help to solve the problem as well to what the mentioned MR have added. I think this should bring back the intended feature of the reinitialize counter to acctually work properly.
Through a lot of debugging I came to the same conclusion as issue #333 , not trying to duplicate the issue, but I think I can do a better job of explaining what's going on here, and how I came to a conclusion as to the root cause.
TL;DR before the wall of text: A
ConnectionResetError: [Errno 104] Connection reset by peer
at the socket level gets handled as aConnectionError
in theClient
object, and in the handler we settry_random_node
, however what we actually need to do is retry the same node. Upon trying a random node we get aMovedError
which causes theNodeManager
to reinitialize which can be very expensive.Taking a step back, be me a few weeks ago, wrapping up a nice new feature, doing some performance testing. I noticed that about 2% of the time my feature's runtime was over 1s, sometimes up to 3s. Effectively we're just grabbing a bunch of data from redis and performing some very simple transformations. On the quick side of things runtimes were as low as 10ms. What could possibly explain these extremely high outlier values?
A couple notes before we get any further, because otherwise some of the numbers you see below may be confusing. This feature is being tested in containers constrained to 100m of cpu and 100MiB of ram. Our redis cluster is located in a single data center, but these containers can be in any data center. Average time to ping redis from within the same data center is 0.5ms, from other data centers it’s about 35ms. Moving on.
Next step is to go down and look at the performance time for each redis command and the payload size, in case larger payloads = longer time.
Here’s a sample outlier case:
Here’s a sample normal case:
What we see is that when the entire feature takes over 1s, all that latency usually falls one command. I was also able to find examples where the exact same redis command was being executed on two different instances, returning the same payload, one takes over 1s, while the other is in the 1-115ms range.
Next step we go even deeper, let’s run
cProfile
on each redis command, and see what’s happening differently in an outlier. Here’s an outlier case (I will cut out most of the profiling output, it’s quite dry):As you can see, in the outlier case almost all the time is spent reinitializing the
NodeManager
. This proved to be true in all the outlier cases I looked at. So the next question: Why is this object being rebuilt? Tracing through the code we can see that on aMovedError
:So by default every 25
MovedErrors
results in a reinitialization, but also a singleMovedError
results in theNodeManager
being rebuilt before the next command is run. This by itself seems like a bug, why enablereinitialize_steps
to be set in the constructor if we’re just going to ignore it and reinitialize on the next command anyways?Next question: Why are we seeing any
MovedErrors
period? We don’t go around resharding our staging redis cluster, so there must be something else going on. My next course of action was to investigate what’s going on inside theNodeManager
. Long story short, I didn’t find anything wrong, theNodeManager
is solid.Ok well now I have a bunch of extra logging in the
NodeManager
, lets log out these moved errors and see what we get. Additionally I logged out the slot returned by theNodeManager
and therepr()
of the connection object returned from theConnectionPool
we are executing the command against:Well obviously we are getting a real
MovedError
as we are executing against the wrong node in our cluster. However theNodeManager
returned the correct slot, so what gives?Might as well log out all the different exceptions that could also be happening here. To my surprise I see that in almost 50% of cases we see at least one of those following errors:
Which gets bubbled up as the following redis error:
Which in turn triggers the
Client
to retry on a random node. Upon retrying said node we have a 5 in 6 chance of getting the wrong node, in which case we get aMovedError
, which in turn triggers the costly rebuilding of theNodeManager
. I hope you can understand my confusion, but why would we retry our command on a random node when we know exactly which node we need to retry on? Perhaps we are an odd case for seeing so many connection interruptions, but in our staging and production environments we probably see hundreds of thousands, maybe even millions of theseConnectionResetErrors
, and not once has it ever been because the node we were communicating with had gone down.So onto the important part: how do we fix this? Well I think the solution in https://github.com/Grokzen/redis-py-cluster/pull/335 certainly works, it’s almost identical to the solution I came up with, and I have since tested this exact change in our staging environment, and have seen 0 instances of given commands taking over a second.
An alternative option that I would propose, if you feel strongly that trying a random node is sometimes still the correct approach to handling this error, perhaps some sort of retry handler with backoff and jitter that tries to correct node say 5 times before trying a random node, might also be an appropriate way to fix this bug. I will also investigate whether or not there is a different error for a node going down than this ConnectionResetError, another option maybe to just handle them separately.
Another alternative is to make trying a random node a configurable option in the
Client
's constructor, because I’m sure there are other users, like us, where this behaviour will never be needed, and is just causing these same performance anomalies.Thank you for taking the time to read this. Please let me know if you have any questions.
Also if it would be more useful for me to throw in a PR with my version and then to go fix the tests and be present to get it polished up and merged, let me know, I don't mind spending the time to get this fix in.