Tronic / redio

Redis client for Python Trio
22 stars 5 forks source link

The connections pool growth enormously #8

Open o-fedorov opened 2 years ago

o-fedorov commented 2 years ago

There are multiple issues with the connection pool:

  1. If a DB instance was not deleted, its connection will never be returned to the pool. And there is no guarantee that __del__ method is to be called soon. Since Redis._borrow_connection does not check for the number of borrowed connections, it is possible for the code to create a lot of connections, that causes an error OSError: all attempts to connect to <redis host> failed
  2. There can be multiple calls to __del__ method during garbage collection, adding the same connection to the pool.

For me, it seems that the pool management should be revisited, also I would add an ability to explicitly borrow and return the connection.

Tronic commented 2 years ago

Basically in CPython deletion should occur immediately when the last reference goes away, and __del__ would only be called again if it resurrected the object (which is not intended to occur). Granted, this is done for convenience at the expense of portability. If running on an environment like IPython or Jupyter Notebook, global variables never get freed so this could indeed become an issue.

Are you seeing these problems with some practical code?

A better and perhaps more conventionally managed connection pool would be a welcome option, for which PRs are definitely welcome. Also, if you can fix the current one without breaking the way that it is used (for backwards compatibility), feel free to work on it.

Unfortunately at this time I won't be able to make much development myself, as I am not actively working on this project. But if you need help debugging the details, I can help you with that.

o-fedorov commented 2 years ago

Thank you, @Tronic, for the response. I really did see issues in a real application, namely on a hypercorn+trio+FastAPI service. It quickly used up all available connections to Redis.

To solve the issue I had to wrap connection acquiring with anyio.CapacityLimiter.

Maybe using a capacity limiter internally could help. Also, it would be nice to have an explicit way to put the connection back to the pool, with a safety check to ensure it is not put twice.

I will think how current code could be updated.

o-fedorov commented 2 years ago

Another thing that may be related, is that redio quite often raises ProtocolError("Pipelining error: server sent unexpected data"), and under high load it raises this error almost constantly.

I suppose that it is a common issue, and I've seen a code that handles situation like that in aioredis:

        # connections that the pool provides should be ready to send
        # a command. if not, the connection was either returned to the
        # pool before all data has been read or the socket has been
        # closed. either way, reconnect and verify everything is good.

In our application, we do a couple of retries in case of ProtocolError, though it requires the operations to be idempotent. I think redio can do better on lower level.

Tronic commented 2 years ago

As for the pipelining error, I gather that this should not happen, with redio only returning known-state idle connections to the pool (others should be disconnected). Probably the handling of that is not quite perfect for some reason. Would rather first look at why it happens and only if there is no other solution implement transparent retry/reconnect.

Can you narrow it down to some specific command, like pubsub? I suspect that something is being returned to the pool that shouldn't, as the library never received wide testing or review.

o-fedorov commented 2 years ago

There are only two operations for the entire application: redis.get(cache_key) and redis.set(cache_key, entry).expire(cache_key, ttl). In the logs, I see failures only for the entries for redis.get, also the application calls to redis.get much more often.

o-fedorov commented 2 years ago

Hi @Tronic. I managed to reproduce the error with ProtocolError. It happens when a connection is broken, for example when it is reset due to some inactivity.

Here how you can reproduce it.

  1. Run redis server locally.
  2. Create a redio pool and execute at least one command so that a connection is added to the pool.
  3. Restart redis server.
  4. Execute an operation, observe ProtocolError("Pipelining error: server sent unexpected data") raised.

Here is the code you can use to make it happen:

import redio

async def main():
    redis = redio.Redis("redis://localhost/")
    await redis().set("x", 1)
    print("Please restart Redis server, then resume execution. To make the code run use `c` command in pdb")
    breakpoint()
    await redis().get("x")

if __name__ == "__main__":
    import trio
    trio.run(main)

Also, if you remove the check for raw_socket.is_readable() here, the error will be:

Traceback (most recent call last):
  File "<...>/site-packages/trio/_highlevel_socket.py", line 28, in _translate_socket_errors_to_stream_errors
    yield
  File "<...>/site-packages/trio/_highlevel_socket.py", line 136, in receive_some
    return await self.socket.recv(max_bytes)
  File "<...>/site-packages/trio/_socket.py", line 339, in wrapper
    return await self._nonblocking_helper(fn, args, kwargs, wait_fn)
  File "<...>/site-packages/trio/_socket.py", line 590, in _nonblocking_helper
    return fn(self._sock, *args, **kwargs)
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<...>/main.py", line 15, in <module>
    trio.run(main)
  File "<...>/site-packages/trio/_core/_run.py", line 1946, in run
    raise runner.main_task_outcome.error
  File "<...>/main.py", line 10, in main
    print(await db.get("x"))
  File "<...>/site-packages/redio/highlevel.py", line 79, in _run
    return await self._execute()
  File "<...>/site-packages/redio/highlevel.py", line 97, in _execute
    res = await self.protocol.run(commands)
  File "<...>/site-packages/redio/protocol.py", line 83, in run
    ret = [await self.receive() for _ in commands]
  File "<...>/site-packages/redio/protocol.py", line 83, in <listcomp>
    ret = [await self.receive() for _ in commands]
  File "<...>/site-packages/redio/protocol.py", line 99, in receive
    t, arg = await self._recvline()
  File "<...>/site-packages/redio/protocol.py", line 113, in _recvline
    buffer += await self.sock.receive_some()
  File "<...>/site-packages/trio/_highlevel_socket.py", line 136, in receive_some
    return await self.socket.recv(max_bytes)
  File "<...>/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "<...>/site-packages/trio/_highlevel_socket.py", line 33, in _translate_socket_errors_to_stream_errors
    raise trio.BrokenResourceError(
trio.BrokenResourceError: socket connection broken: [Errno 54] Connection reset by peer
Tronic commented 2 years ago

I guess the library could detect that the connection goes down. The error that you are seeing specifically appears because the Redis server says something as it shuts down the connection, so when the next command is being run, Redio sees that there is something ready to be read that is not a response to any of the commands sent.

This might be best handled by testing the state of the connection as it handed out of the pool, and create a new one if it has gone bad. That wouldn't avoid all errors (if the server restarts while the application is using a connection) but at least it could avoid trying to use connections after they were disconnected while they were idle.