chdsbd / kodiak

🔮 A bot to automatically update and merge GitHub PRs
https://kodiakhq.com
GNU Affero General Public License v3.0
1.04k stars 63 forks source link

self-hosted: redis connection pool fills up after a while and bot enters an unhealthy state #712

Open wuub opened 3 years ago

wuub commented 3 years ago

I don't have any specific reproduction steps yet, but it seems that there exists a path in which a connection in redis connection pool are not closed properly (left in use).

This causes the pool to slowly fill up and ultimately all progress stalls here: https://github.com/jonathanslenders/asyncio-redis/blob/master/asyncio_redis/pool.py#L140

I'm adding some metrics right now to catch this red-handed and inspect logs around a time when it happens. Bur maybe you already know :)

wuub commented 3 years ago

Ok, so now I'm pretty sure I can recreate the issue

  1. docker-compose up the bot with redis running alongside it
  2. docker pause bot_redis_1
  3. trigger some webhooks, it's fine if they are for a single PR, trigger a bunch of them, like 10 maybe
  4. they will hang, obviously
  5. let is sit like this for a while, I tried a 1-2 minutes (not sure if necessary)
  6. docker stop bot_redis_1
  7. you'll get a bunch of tracebacks
    kodiak_1  |     result = yield from answer_f
    kodiak_1  | asyncio_redis.exceptions.ConnectionLostError: [Errno 104] Connection reset by peer

    8 docker start bot_redis_1

  8. kodiak itself will recover BUT the in-use connection in the redis pool will stay above 1, often at the number of webhooks which were blocked by docker pause. which is not what should happen as the are occupying the pool indefinitely now

rinse + repeat the above failure scenario and you have yourself a redis pool which is a) doing nothing b) not able to perform any more work

wuub commented 3 years ago

Full TB for easier debugging:

kodiak_1  | ERROR uvicorn:httptools_impl.py:371 Exception in ASGI application
kodiak_1  | Traceback (most recent call last):
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 368, in run_asgi
kodiak_1  |     result = await app(self.scope, self.receive, self.send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/fastapi/applications.py", line 208, in __call__
kodiak_1  |     await super().__call__(scope, receive, send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/applications.py", line 112, in __call__
kodiak_1  |     await self.middleware_stack(scope, receive, send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__
kodiak_1  |     raise exc from None
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
kodiak_1  |     await self.app(scope, receive, _send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 106, in _run_asgi3
kodiak_1  |     return await self._run_app(scope, lambda: self.app(scope, receive, send))
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 152, in _run_app
kodiak_1  |     raise exc from None
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/sentry_sdk/integrations/asgi.py", line 149, in _run_app
kodiak_1  |     return await callback()
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
kodiak_1  |     raise exc from None
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
kodiak_1  |     await self.app(scope, receive, sender)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 580, in __call__
kodiak_1  |     await route.handle(scope, receive, send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 241, in handle
kodiak_1  |     await self.app(scope, receive, send)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/starlette/routing.py", line 52, in app
kodiak_1  |     response = await func(request)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/fastapi/routing.py", line 220, in app
kodiak_1  |     dependant=dependant, values=values, is_coroutine=is_coroutine
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/fastapi/routing.py", line 152, in run_endpoint_function
kodiak_1  |     return await dependant.call(**values)
kodiak_1  |   File "/var/app/kodiak/main.py", line 101, in webhook_event
kodiak_1  |     queue=redis_webhook_queue, event_name=github_event, payload=event
kodiak_1  |   File "/var/app/kodiak/queue.py", line 291, in handle_webhook_event
kodiak_1  |     await pr_review(queue, PullRequestReviewThreadEvent.parse_obj(payload))
kodiak_1  |   File "/var/app/kodiak/queue.py", line 177, in pr_review
kodiak_1  |     installation_id=str(review.installation.id),
kodiak_1  |   File "/var/app/kodiak/queue.py", line 542, in enqueue
kodiak_1  |     transaction = await self.connection.multi()
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 733, in wrapper
kodiak_1  |     result = yield from method(protocol_self, _NoTransaction, *a, **kw)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 2824, in multi
kodiak_1  |     result = yield from self._query(tr, b"multi", _bypass=True)
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 1231, in _query
kodiak_1  |     transaction, answer_f, _bypass=_bypass, call=call
kodiak_1  |   File "/var/app/.venv/lib/python3.7/site-packages/asyncio_redis/protocol.py", line 1155, in _get_answer
kodiak_1  |     result = yield from answer_f
kodiak_1  | asyncio_redis.exceptions.ConnectionLostError: [Errno 104] Connection reset by peer
chdsbd commented 3 years ago

For the hosted version of Kodiak we have a large connection pool and do nightly restarts, which I think is why this bug hasn't been an issue for us.

Is your Redis instance configured to timeout connections after some time. That caused this similar issue: https://github.com/chdsbd/kodiak/issues/694#issuecomment-893491026

I'm not exactly sure how this issue should be fixed, but an update needs to be made to the forked redis library: https://github.com/chdsbd/asyncio-redis

This issue seems related: https://github.com/jonathanslenders/asyncio-redis/issues/109

chdsbd commented 3 years ago

I think any time a connection is used in asyncio-redis, the connection should be wrapped with a context manager that will return the connection to the pool if there is an exception. I think that would prevent leaking connections.

I think the existing pool implementation needs to improved because I don't see an obvious place where to add this new logic: https://github.com/chdsbd/asyncio-redis/blob/master/asyncio_redis/pool.py

I think we probably want something similar to the async-pg pool API: https://magicstack.github.io/asyncpg/current/usage.html#connection-pools


pool = Redis.create_pool()

async with pool.acquire() as connection:
    user = await connection.get('user:123')
    print(user['id'])
wuub commented 3 years ago

Is your Redis instance configured to timeout connections after some time.

it's AWS Elasticache 5.x redis on default settings. IIUC there's no enforced timeout by default there OTOH it's AWS. We do not need a configured timeout for network to blip from time to time on its own :)

we have a large connection pool

I also bumped the default to 200 from the default 20 during the previous outage. I then wrongly assumed it was due to traffic spikes. This also explains why kodiak slows down its PR/webhook processing loop when nearing the outage and picks up just after restart -> it's waiting for a free redis connection.

re: fix for pool

yeah, I assumed so when I did not find any proper exception handling in asyncio_redis. I'll do my own investigation as well, maybe there's a cheap way to put a bandage on it until proper fix is ready