Closed jwkoelewijn closed 2 years ago
We also had some dramatic issues in production when a redis connection is lost (we don't don't why but the socket was not in lsof
any more) lead to a faye process stuck with all requests timeouts, no error returned, no error logged, no raise etc.. The error handling is simply absent, fortunately we have good monitoring and managed to detect the issue and restart the processes.
We dug into it and tried your fix but had the same problem with it, and when we try locally with your fix it SEGFAULT... Anyway the event loop is started in a thread which swallows exceptions so the raise here doesn't stop the process and the event loop is restarted automatically. Why not hide every errors and try to recover but that only works if you do it well, here when the event loop is restarted the redis client stays in dead state and don't try to reconnect which block the process indefinitely without a single warning.
In the end we managed to improve the situation (no problem so good) by also handling the failed event on the main Redis connection @redis
(this seems to fix the segfault) and setting @redis
to nil
so when the raise stops the EM and it is restarted, the Redis client is recreated and reconnected properly. Our local test shows no more segfault and redis reconnects successfully.
Here is the code:
@subscriber.on(:failed) do
@server.error "Faye::Redis: redis connection failed"
@redis = nil
raise "Could not connect to redis"
end
@redis.on(:failed) do
@server.error "Faye::Redis: redis connection failed"
@redis = nil
raise "Could not connect to redis"
end
I'm not doing any PR because james don't merge them any more.
Nice one, seems more complete to me, plus, especially because it doesn't segfault anymore. Thanks!
At the moment, a failure to (re)connect to redis is not correctly exposed by this engine. Failure occurs silently, without any messages. By subscribing to the :failed event the PubSub-client, the Faye-Redis engine is notified when a connection failure is happening, and will raise an error, which is useful for users.
This commit came to life after a hunt for a non-functioning deployment in which ultimately a non-started Redis database was to blame. It would have been nice if an error would have been raised.