Grokzen / redis-py-cluster

Python cluster client for the official redis cluster. Redis 3.0+.
https://redis-py-cluster.readthedocs.io/
MIT License
1.1k stars 316 forks source link

Possible connection leak in ClusterBlockingConnectionPool #458

Open FranGM opened 3 years ago

FranGM commented 3 years ago

Hey there!

I've been tracking down an issue that I believe could have been caused (or at least made worse) by a possible connection leak in ClusterBlockingConnectionPool. Let me try to explain my theory to see if it makes sense.

In get_connection_by_node we pull connections from the queue one by one until we find a connection already established to this node or a None (free slot for a connection we can make).

My concern is with the logic that follows if we run out of items to pull from the queue:

            # queue is full of connections to other nodes
            if len(connections_to_other_nodes) == self.max_connections:
                # is the earliest released / longest un-used connection
                connection_to_clear = connections_to_other_nodes.pop()
                self._connections.remove(connection_to_clear)
                connection_to_clear.disconnect()
                connection = None  # get a new connection
            else:
                # Note that this is not caught by the redis cluster client and will be
                # raised unless handled by application code.
                # ``ConnectionError`` is raised when timeout is hit on the queue.
                raise ConnectionError("No connection available")

If we have pulled exactly max_connections from the pool and all of them are for different nodes then we repurpose one of those connections, otherwise we raise an Exception. Note that when raising this exception we don't put any connections we might have pulled back into the pool.

I think the logic on if len(connections_to_other_nodes) == self.max_connections: is a bug because it implies that all connections are available inside the queue at all times, but connections that are currently being used won't be in the pool, which means there are cases when you have less than max_connections elements in the queue (most of the time really), so we will potentially raise an exception when we have connections we could repurpose.

I'm currently testing a fix that would change that line with if len(connections_to_other_nodes) >0: but wanted to hear your opinion on this hypothesis before submitting a PR.

Grokzen commented 3 years ago

@FranGM I could not say for sure i was not the one who made the blockingconnection pool implementation so to even say anything about it i would have to dig into that code and understand it to be able to answer your question really. I have never really been a fan of this shared style of connection pool as it cause these kinds of issues and other, In 3.0 this kind of error should maybe be gone, not sure with blocking solultion thos, but each node will get their individual connectionpool solution again so each node should sort itself out in the exact same way that redis-py does already so there will be even less reimplementation of code needed which i like much more.

FranGM commented 2 years ago

Hey @Grokzen , sorry for the radio silence, been busy lately.

Yeah that makes sense, we've had our own fix running in production for a while so I've gone ahead and made a PR for it (along with a couple other things). We've been running our own internally built version of redis-py-cluster but we'd love to go back to an unpatched version if possible.