Closed Aohzan closed 1 month ago
Hello @Aohzan
What is your Centrifugo configuration for Redis Sentinel? Do you use redis_sentinel_address
? Which version of Centrifugo you have? Which Redis version? Anything specific in your setup? Asking to also understand why Centrifugo could miss master change.
centrifugo 4.1.2-0 redis-sentinel 6:7.2.4-1rl1~bullseye1 (upgraded from 5:6.0.16-1+deb11u2 before the sentinel failover)
redis configuration part in centrifugo
"engine": "redis",
"redis_sentinel_address": "localhost:26379",
"redis_sentinel_master_name": "mymaster",
there is:
so 7 sentinel, the failover of the master has been handled correctly on all sentinel, but just 1 of 4 centrifugo didn't change its redis server
Thanks for the details. In your case, I'd concentrate on trying to fix the root cause - I suppose first step here is upgrading to the latest Centrifugo version and trying to reproduce with it – there were many improvements in the underlying Redis library since v4.1.2.
I think adding Redis connection check in health endpoint may not fully solve the problem. It may hide the problem, until no Centrifugo nodes left. I suppose you have just a Haproxy without Kubernetes? In this case Haproxy will remove failed node from the balancing, but the issue on Node will persist until someone restarts Centrifugo node. Is it right? Or it works differently? Probably there is some automation to restart failing node?
In Kubernetes liveness probe failure results into app restart – in that case Redis connection check could make more sense.
Yes, we will see to update to the latest version.
Yes, no k8s, what we want it's that the HAProxy will exclude the problematic Centrifugo from the pool as soon as it can't handle request properly due to Redis issue. So it let me the time to debug and restart Centrifugo.
Hello We planned the upgrade, as we had the same issue on all nodes after a redis failover. For the original request, do you have an opinion?
We planned the upgrade, as we had the same issue on all nodes after a redis failover.
Probably you can experiment with new version first regarding Sentinel failover to make sure it solves the problem. I mean without upgrading rest of the app. While upgrade is recommended anyway, this may help to iterate on issue faster.
For the original request, do you have an opinion?
I think optional Redis check could make sense in general, though prefer it being disabled by default – when dealing with many connections removing node from balancing may be not better that waiting for connection issues to go away. Looks like Centrifugo needs to issue some write request to Redis to make sure connection is working and writable.
UPD. Though not sure what to do with Redis Cluster case, where many Redis shards must be checked. In this perspective I'd invest to proper failover as I said before
Hello, no issue since centrifugo upgrade :slightly_smiling_face:
Hello,
We had an issue on our platform, we perform a HAProxy check on /health on centrifugo servers, which was still ok while centrifugo raised errors because Redis connection issued
READONLY You can't write against a read only replica
errors. centrifugo didn't change its redis server when sentinel announced the new primary, we had the issue on 1 of 4 servers.Describe the solution you'd like It would be great if the health check will return an error if there is any redis connection error
Thank you