Entry node is single point of failure

JanBerktold commented 4 years ago

When the Redis entrynode dies, the proxy will break when the next updateCluster invocation comes around because it can not talk to it's entry node anymore, causing cluster->broken to be set which just causes every new request to be replied to with an error. This disturbs consumer traffic when not actually needed - The cluster can still be fully healthy if the entry node happened to be a slave or a master from which all slots were migrated away.

I'd like to propose for the proxy to, on a failed configuration fetch from it's entry node, to try all (or a subset?) of all other nodes that it was aware of. This can take a non-trivial time but it seems preferable to just fully bricking the proxy.

A possible implementation might be to save ip+ports before reseting the cluster and trying them one by one until a valid configuration is fetched here: https://github.com/artix75/redis-cluster-proxy/blob/unstable/src/cluster.c#L827

Happy to send a PR but would like to get feedback on the approach first.

Note: Might be similar to #8, maybe there's a solution which solves both cases?

artix75 commented 4 years ago

The best solution is that, after the first connection to the cluster, every node in the cluster should be considered an entry point, and in case of reconfiguration, all entry points should be tried until the first one succeeds.

Also, the ability to specify multiple entry points on launch could be an interesting feature, but, independently from that, every time that the whole cluster's configuration is fetched, every node should be considered a potential entry point.

I will implement this feature in the next weeks.

artix75 commented 4 years ago

@JanBerktold Try the latest unstable branch. Now it's possible to specify multiple entry points, as command-line arguments or inside the config file (in this case use the 'cluster' option or its alias 'entry-point' option). Furthermore, after the proxy fetches the cluster nodes configuration, it will automatically use those nodes as entry-points for the future (ie. in case of update after an ASK|MOVED reply).

Examples:

redis-cluster-proxy localhost:7000 localhost:7001 localhost:7002

Or using a config file:

redis-cluster-proxy -c /path/to/proxy.conf

# proxy.conf

entry-point localhost:7000
entry-point localhost:7001
entry-point localhost:7002

JanBerktold commented 4 years ago

Thanks @artix75, this is a big! It does certainly improve the reliability by quite a bit based off my short testing session. I do however still notice one issue: The logic appears to be "find one redis node that I can open a socket to and then use that" which breaks when a Redis node is in a up-but-not-yet-ready-state, such as when it is replica still in the initial startup phase:

Failed to retrieve cluster configuration.
Cluster node 10.128.23.169:28687 replied with error:
LOADING Redis is loading the dataset in memory

I'd suggest moving the entire fetching flow under the retry logic to protect against this case and any other like it. What do you think?

artix75 commented 4 years ago

@JanBerktold Yeah, I'm working on this

RedisLabs / redis-cluster-proxy

Entry node is single point of failure #34