self hosted: EOF received in RedisProtocol after upgrading to release v0.49

chdsbd / kodiak

🔮 A bot to automatically update and merge GitHub PRs

https://kodiakhq.com

GNU Affero General Public License v3.0

1.03k stars 65 forks source link

self hosted: EOF received in RedisProtocol after upgrading to release v0.49 #788

Closed rdmulford closed 2 years ago

rdmulford commented 2 years ago

After updating our self hosted kodiak instance to release 0.49 we've been seeing unusual log messages looking like:

which seem to spike up every 15 minutes. The following graph shows log counts matching these errors:

We did not see these errors in version 0.48

we haven't changed any settings/configs in our redis instance and this is what our timeout settings look like:

"timeout"
"0"
--
"repl-timeout"
"60"
--
"cluster-node-timeout"
"15000"

The app still seems to work (its able to merge pull requests etc.) but im cautious to deploy this to our prod instance until we understand what this error is. Any help understanding what is happening here would be greatly appreciated!

sbdchd commented 2 years ago

hmm, seems somewhat related to https://github.com/chdsbd/kodiak/issues/694

maybe an issue with the timeout set in the redis host?

rdmulford commented 2 years ago

Thanks @sbdchd. I did look at #694 in our initial investigations. The resolution there seemed to be set the timeout to 0, which we confirmed is our timeout setting on our redis instance.

Given that this is specifically happening to us between 0.48 and 0.49, we are suspecting this is something introduced from the Kodiak side in one of these commits https://github.com/chdsbd/kodiak/compare/v0.48.0...v0.49.0

chdsbd commented 2 years ago

@rdmulford Looking at the diff you linked to, I don't see any Redis related changes.

EOF errors aren't something we've encountered on the hosted Kodiak GitHub App. I'm guessing your connections between your Kodiak container and Redis are getting dropped. Either by Redis timeouts or something in front of Redis.

rdmulford commented 2 years ago

Thanks @chdsbd digging in deeper it looks like the issue is related to how the client is connecting to our loadbalancer for the redis instance, as connecting directly to the redis instance IP makes the issue go away.