Open wuub opened 3 years ago
Ok, so now I'm pretty sure I can recreate the issue
docker-compose up
the bot with redis running alongside itdocker pause bot_redis_1
docker stop bot_redis_1
kodiak_1 | result = yield from answer_f
kodiak_1 | asyncio_redis.exceptions.ConnectionLostError: [Errno 104] Connection reset by peer
8 docker start bot_redis_1
in-use
connection in the redis pool will stay above 1, often at the number of webhooks which were blocked by docker pause
. which is not what should happen as the are occupying the pool indefinitely nowrinse + repeat the above failure scenario and you have yourself a redis pool which is a) doing nothing b) not able to perform any more work
Full TB for easier debugging:
kodiak_1 | ERROR uvicorn:httptools_impl.py:371 Exception in ASGI application
kodiak_1 | Traceback (most recent call last):
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 368, in run_asgi
kodiak_1 | result = await app(self.scope, self.receive, self.send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/fastapi/applications.py", line 208, in __call__
kodiak_1 | await super().__call__(scope, receive, send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/applications.py", line 112, in __call__
kodiak_1 | await self.middleware_stack(scope, receive, send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__
kodiak_1 | raise exc from None
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
kodiak_1 | await self.app(scope, receive, _send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 106, in _run_asgi3
kodiak_1 | return await self._run_app(scope, lambda: self.app(scope, receive, send))
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 152, in _run_app
kodiak_1 | raise exc from None
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 149, in _run_app
kodiak_1 | return await callback()
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
kodiak_1 | raise exc from None
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
kodiak_1 | await self.app(scope, receive, sender)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 580, in __call__
kodiak_1 | await route.handle(scope, receive, send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 241, in handle
kodiak_1 | await self.app(scope, receive, send)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 52, in app
kodiak_1 | response = await func(request)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/fastapi/routing.py", line 220, in app
kodiak_1 | dependant=dependant, values=values, is_coroutine=is_coroutine
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/fastapi/routing.py", line 152, in run_endpoint_function
kodiak_1 | return await dependant.call(**values)
kodiak_1 | File "/var/app/kodiak/main.py", line 101, in webhook_event
kodiak_1 | queue=redis_webhook_queue, event_name=github_event, payload=event
kodiak_1 | File "/var/app/kodiak/queue.py", line 291, in handle_webhook_event
kodiak_1 | await pr_review(queue, PullRequestReviewThreadEvent.parse_obj(payload))
kodiak_1 | File "/var/app/kodiak/queue.py", line 177, in pr_review
kodiak_1 | installation_id=str(review.installation.id),
kodiak_1 | File "/var/app/kodiak/queue.py", line 542, in enqueue
kodiak_1 | transaction = await self.connection.multi()
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 733, in wrapper
kodiak_1 | result = yield from method(protocol_self, _NoTransaction, *a, **kw)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 2824, in multi
kodiak_1 | result = yield from self._query(tr, b"multi", _bypass=True)
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 1231, in _query
kodiak_1 | transaction, answer_f, _bypass=_bypass, call=call
kodiak_1 | File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 1155, in _get_answer
kodiak_1 | result = yield from answer_f
kodiak_1 | asyncio_redis.exceptions.ConnectionLostError: [Errno 104] Connection reset by peer
For the hosted version of Kodiak we have a large connection pool and do nightly restarts, which I think is why this bug hasn't been an issue for us.
Is your Redis instance configured to timeout connections after some time. That caused this similar issue: https://github.com/chdsbd/kodiak/issues/694#issuecomment-893491026
I'm not exactly sure how this issue should be fixed, but an update needs to be made to the forked redis library: https://github.com/chdsbd/asyncio-redis
This issue seems related: https://github.com/jonathanslenders/asyncio-redis/issues/109
I think any time a connection is used in asyncio-redis, the connection should be wrapped with a context manager that will return the connection to the pool if there is an exception. I think that would prevent leaking connections.
I think the existing pool implementation needs to improved because I don't see an obvious place where to add this new logic: https://github.com/chdsbd/asyncio-redis/blob/master/asyncio_redis/pool.py
I think we probably want something similar to the async-pg pool API: https://magicstack.github.io/asyncpg/current/usage.html#connection-pools
pool = Redis.create_pool()
async with pool.acquire() as connection:
user = await connection.get('user:123')
print(user['id'])
Is your Redis instance configured to timeout connections after some time.
it's AWS Elasticache 5.x redis on default settings. IIUC there's no enforced timeout by default there OTOH it's AWS. We do not need a configured timeout for network to blip from time to time on its own :)
we have a large connection pool
I also bumped the default to 200 from the default 20 during the previous outage. I then wrongly assumed it was due to traffic spikes. This also explains why kodiak slows down its PR/webhook processing loop when nearing the outage and picks up just after restart -> it's waiting for a free redis connection.
re: fix for pool
yeah, I assumed so when I did not find any proper exception handling in asyncio_redis. I'll do my own investigation as well, maybe there's a cheap way to put a bandage on it until proper fix is ready
I don't have any specific reproduction steps yet, but it seems that there exists a path in which a connection in redis connection pool are not closed properly (left in use).
This causes the pool to slowly fill up and ultimately all progress stalls here: https://github.com/jonathanslenders/asyncio-redis/blob/master/asyncio_redis/pool.py#L140
I'm adding some metrics right now to catch this red-handed and inspect logs around a time when it happens. Bur maybe you already know :)