chdsbd / kodiak

🔮 A bot to automatically update and merge GitHub PRs
https://kodiakhq.com
GNU Affero General Public License v3.0
1.03k stars 65 forks source link

Unable to connect to Redis due to EOF #694

Closed zetaron closed 2 years ago

zetaron commented 2 years ago

When configuring a self-hosted kodiak bot to connect to a Redis Instance managed by DigitalOcean it fails with the following error message on repeat:

INFO asyncio_redis:connection.py:116 Connecting to redis
INFO asyncio_redis:protocol.py:897 Redis connection made
INFO asyncio_redis:protocol.py:955 EOF received in RedisProtocol
INFO asyncio_redis:protocol.py:979 Redis connection lost
ERROR asyncio:main.py:306 Task exception was never retrieved
future: <Task finished coro= exception=ConnectionLostError(None)>
Traceback (most recent call last):
  File "/var/app/.venv/src/asyncio-redis/asyncio_redis/protocol.py", line 911, in initialize
    yield from self.auth(self.password)
  File "/var/app/.venv/src/asyncio-redis/asyncio_redis/protocol.py", line 733, in wrapper
    result = yield from method(protocol_self, _NoTransaction, *a, **kw)
  File "/var/app/.venv/src/asyncio-redis/asyncio_redis/protocol.py", line 1231, in _query
    transaction, answer_f, _bypass=_bypass, call=call
  File "/var/app/.venv/src/asyncio-redis/asyncio_redis/protocol.py", line 1155, in _get_answer
    result = yield from answer_f
asyncio_redis.exceptions.ConnectionLostError: None

From the DigitalOcean FAQ I take that the EOF error would usualy happen when the connecting client does not support TLS.

Is there a way to use TLS or would that require switching to the official redis library? If switching would be required, how hard would that be? Would you be open to PRs?


How valuable is the data thats stored inside the redis? Would it be bad to loose it?

We are hosting on Kubernetes and I could deploy redis in memory ... but that would be subject to potential rescheduling and movement accross cluster nodes.

chdsbd commented 2 years ago

Hi @zetaron,

Thanks for opening this issue.

Kodiak uses a slightly modified fork of asyncio-redis, which doesn't have TLS support built in.

I think it's possible to update the library to support TLS. It uses the asyncio loop.create_connection, which supports TLS.

https://github.com/chdsbd/asyncio-redis/blob/7d174e0ebe2dc52340001097eeee48a868454480/asyncio_redis/connection.py#L118-L120

There are other asyncio Redis clients around, but I don't know how hard they would be to integrate. I'm open to PRs that add TLS support.

One alternative is to use stunnel to create a connection using TLS between hosts without needed to update the Redis client.

Another option is using something like Digital Ocean's private networking feature to isolate traffic.

Kodiak uses Redis to store merge queue positions. So if your Redis instance were to disappear, Kodiak would lose track of any PR waiting to be merged.

zetaron commented 2 years ago

Thanks for your quick reply :)

I'll attempt to update your fork of the asyncio_redis library to support TLS, huge thanks for the pointers.

DigitalOceans private networking still (from what I see) requires TLS to connect.

zetaron commented 2 years ago

I've created two PRs to implement TLS support:

I did test the change locally using docker-compose against my DigitalOcean managed Redis and the above error did not happen again. Since I'm not too familiar with python and this project in general I was unable to add unit-tests. If that is something you want please feel free to modify my PRs :)

chdsbd commented 2 years ago

Thanks for the PRs!

You should be all set with Redis and TLS using the latest commit: f487e799a74ae074ee16c9106c9750925d329be4

zetaron commented 2 years ago

Great :D thanks for the quick merge!

Would you mind also triggering a release for the docker image?

sbdchd commented 2 years ago

CI automatically builds and releases a docker image with the latest commit SHA, so if you nav to https://hub.docker.com/r/cdignam/kodiak/tags, the most recently released version is cdignam/kodiak:f487e799a74ae074ee16c9106c9750925d329be4

zetaron commented 2 years ago

Ah that slipped me when I checked.

zetaron commented 2 years ago

Hi, again :)

I've deployed the new Image and after some time observed the EOF coming back but seemingly recovering. Maybe you know of the top of your head how many locations there are where a redis connection gets created/recreated? Or if the recreation of the connection is expected after it got used for handling an event? I'm currently using a Pool Size of 10.

logs-from-server-in-kodiak-hlztb-deployment-6477778f9b-vwvx7.txt

sbdchd commented 2 years ago

I believe the EOF method on the asyncio.Protocol is called when the server sends an EOF

https://github.com/chdsbd/asyncio-redis/blob/388a6ac390ae70ee16bdc45cf19b308d11212566/asyncio_redis/protocol.py#L954-L956

https://docs.python.org/3.9/library/asyncio-protocol.html#asyncio.Protocol.eof_received

So my first thought is Redis is killing the connections after a while and the python client is recreating the connection when it dies

Do you have timestamps to determine how long after the initial pool creation the connections die? Might shed some light on if it's timeout related

zetaron commented 2 years ago

I’ll see to get some, in the UI there are timestamps but the downloaded file seems to miss them.

Oddly enough from what I observed it was related to events being handled as it only happened after the first events started coming in no matter how quick or long that took. But I’ll try to solidify that with some more log data maybe a longer timeframe to be able to spot recurrence.

zetaron commented 2 years ago

After following your hint on timeouts and some research on redis timeouts in relation to DigitalOcean Managed Redis I found this Question on how to set it to 0 (disabled)

The default timeout for the managed Redis seems to be: 300

I've gathered some more logs, but apart from an overwhelming amount of the following lines occurring every 2-3 seconds apart, which now makes total sense since the default connection timeout is 300ms.


I'll try and have DigitalOcean that set to 0 on my kodiak instance by opening a Ticket with them.

Sorry for the comotion but maybe this helps someone else stumbling into similar issues :)

zetaron commented 2 years ago

After DigitalOcean set the timeout to 0 the issue got resolved :)