getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.64k stars 4.14k forks source link

Redis connection leak after upgrading from 8.10 to 8.14 #5078

Closed wridgers closed 7 years ago

wridgers commented 7 years ago

Having upgraded from 8.10.0 to 8.14.1, I'm seeing Sentry crashing when lots of events come in (>100k) - this didn't used to be a problem. Upon investigation, there are lots of idle connections to Redis (CLIENT LIST in redis-cli), it would seem to hit a file descriptor limit and grind to a halt.

I'm running sentry behind nginx, not in Docker.

mattrobenolt commented 7 years ago

Closed as a duplicate of https://github.com/getsentry/sentry/issues/5056

This is 100% configuration with Redis and your network failing to correctly handle disconnecting clients. Sentry the server will open connections as needed, and it'd be expected of redis to handle them and clean up old/idle connections. But somehow they're just not being detected as disconnected so Redis is hanging onto them forever.

Set the client timeout in redis to something like, 3600 or higher just to make sure it'll clean up.

wridgers commented 7 years ago

@mattrobenolt A colleague reminded me I should mention redis and Sentry are running on the same box, so network shouldn't be an issue here.

Nonetheless I'll bump the timeout to 3600 and see if that improves the situation.

Redis version is 3.0.7.

mattrobenolt commented 7 years ago

That seems... really odd then. I honestly don't have an answer if that's the case. Because you're right, if it's all 127.0.0.1 traffic, most of the Linux networking stack is bypassed and there shouldn't be an issue here.

Otherwise, the only real explanation is that your volume of traffic mandates more connections. Does the connection count forever increase? Even after process restarts? Or is it just flatlined really high?

wridgers commented 7 years ago

As far as I can tell at the moment, the number of connections just keeps going up. Eventually the number of connections hit a file descriptor limit (4096) and then sentry stopped responding. 502 from nginx. Restarting worker/cron/uwsgi cleared the problem.

Earlier today I did CLIENT LIST in Redis and the majority of connections had been idle in excess of 10,000 seconds.

mattrobenolt commented 7 years ago

If a restart of processes fixed it, then not sure what's going on. :)

But if you're seeing idle connections for 10k seconds, I'd just set the client timeout to something reasonable. 3600 or 7200 or something just to keep things chill.

To be clear, we run this obviously for sentry.io and we don't have any issue with connection limits or connections climbing. So I can only assume there's an issue on your end.

wridgers commented 7 years ago

Okay. I've set a timeout and will keep an eye on the situation. I'll update this thread if I find anything.

Thanks for your help.