[feature] Improve health check by checking redis connection

centrifugal / centrifugo

Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably. Set up once and forever.

https://centrifugal.dev

Apache License 2.0

8.44k stars 598 forks source link

[feature] Improve health check by checking redis connection #766

Closed Aohzan closed 1 month ago

Aohzan commented 10 months ago

Hello,

We had an issue on our platform, we perform a HAProxy check on /health on centrifugo servers, which was still ok while centrifugo raised errors because Redis connection issued READONLY You can't write against a read only replica errors. centrifugo didn't change its redis server when sentinel announced the new primary, we had the issue on 1 of 4 servers.

Describe the solution you'd like It would be great if the health check will return an error if there is any redis connection error

Thank you

FZambia commented 10 months ago

Hello @Aohzan

What is your Centrifugo configuration for Redis Sentinel? Do you use redis_sentinel_address? Which version of Centrifugo you have? Which Redis version? Anything specific in your setup? Asking to also understand why Centrifugo could miss master change.

Aohzan commented 10 months ago

centrifugo 4.1.2-0 redis-sentinel 6:7.2.4-1rl1~bullseye1 (upgraded from 5:6.0.16-1+deb11u2 before the sentinel failover)

redis configuration part in centrifugo

  "engine": "redis",                     
  "redis_sentinel_address": "localhost:26379",
  "redis_sentinel_master_name": "mymaster",

there is:

4 centrifugo servers with centrifugo + sentinel
3 redis servers with redis + sentinel

so 7 sentinel, the failover of the master has been handled correctly on all sentinel, but just 1 of 4 centrifugo didn't change its redis server

FZambia commented 10 months ago

Thanks for the details. In your case, I'd concentrate on trying to fix the root cause - I suppose first step here is upgrading to the latest Centrifugo version and trying to reproduce with it – there were many improvements in the underlying Redis library since v4.1.2.

I think adding Redis connection check in health endpoint may not fully solve the problem. It may hide the problem, until no Centrifugo nodes left. I suppose you have just a Haproxy without Kubernetes? In this case Haproxy will remove failed node from the balancing, but the issue on Node will persist until someone restarts Centrifugo node. Is it right? Or it works differently? Probably there is some automation to restart failing node?

In Kubernetes liveness probe failure results into app restart – in that case Redis connection check could make more sense.

Aohzan commented 10 months ago

Yes, we will see to update to the latest version.

Yes, no k8s, what we want it's that the HAProxy will exclude the problematic Centrifugo from the pool as soon as it can't handle request properly due to Redis issue. So it let me the time to debug and restart Centrifugo.

Aohzan commented 9 months ago

Hello We planned the upgrade, as we had the same issue on all nodes after a redis failover. For the original request, do you have an opinion?

FZambia commented 9 months ago

We planned the upgrade, as we had the same issue on all nodes after a redis failover.

Probably you can experiment with new version first regarding Sentinel failover to make sure it solves the problem. I mean without upgrading rest of the app. While upgrade is recommended anyway, this may help to iterate on issue faster.

For the original request, do you have an opinion?

I think optional Redis check could make sense in general, though prefer it being disabled by default – when dealing with many connections removing node from balancing may be not better that waiting for connection issues to go away. Looks like Centrifugo needs to issue some write request to Redis to make sure connection is working and writable.

UPD. Though not sure what to do with Redis Cluster case, where many Redis shards must be checked. In this perspective I'd invest to proper failover as I said before

Aohzan commented 1 month ago

Hello, no issue since centrifugo upgrade :slightly_smiling_face: