Open FranGM opened 3 years ago
@FranGM I could not say for sure i was not the one who made the blockingconnection pool implementation so to even say anything about it i would have to dig into that code and understand it to be able to answer your question really. I have never really been a fan of this shared style of connection pool as it cause these kinds of issues and other, In 3.0 this kind of error should maybe be gone, not sure with blocking solultion thos, but each node will get their individual connectionpool solution again so each node should sort itself out in the exact same way that redis-py does already so there will be even less reimplementation of code needed which i like much more.
Hey @Grokzen , sorry for the radio silence, been busy lately.
Yeah that makes sense, we've had our own fix running in production for a while so I've gone ahead and made a PR for it (along with a couple other things). We've been running our own internally built version of redis-py-cluster
but we'd love to go back to an unpatched version if possible.
Hey there!
I've been tracking down an issue that I believe could have been caused (or at least made worse) by a possible connection leak in
ClusterBlockingConnectionPool
. Let me try to explain my theory to see if it makes sense.In get_connection_by_node we pull connections from the queue one by one until we find a connection already established to this node or a
None
(free slot for a connection we can make).My concern is with the logic that follows if we run out of items to pull from the queue:
If we have pulled exactly
max_connections
from the pool and all of them are for different nodes then we repurpose one of those connections, otherwise we raise an Exception. Note that when raising this exception we don't put any connections we might have pulled back into the pool.I think the logic on
if len(connections_to_other_nodes) == self.max_connections:
is a bug because it implies that all connections are available inside the queue at all times, but connections that are currently being used won't be in the pool, which means there are cases when you have less thanmax_connections
elements in the queue (most of the time really), so we will potentially raise an exception when we have connections we could repurpose.I'm currently testing a fix that would change that line with
if len(connections_to_other_nodes) >0:
but wanted to hear your opinion on this hypothesis before submitting a PR.