TTL timeout during planned failover / ElastiCache maintenance

mschfh commented 3 years ago

the fix in #385 was incomplete, the cluster might not be reinitialized if reinitialize_counter is already increased: https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/client.py#L661-L664 https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/nodemanager.py#L348-L352

During a planned ElastiCache failover, the old primary redirects requests to the new one by sending MOVED responses (thereby increasing reinitialize_counter) before terminating the old primary. https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/client.py#L689

An alternative approach to the counter would be to reinitialize on all MOVED responses (as discussed before), as those would always affect a larger amount of slots (at least with EC). This approach is also suggested in the redis documentation:

When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy https://redis.io/topics/cluster-spec

Grokzen commented 3 years ago

So if i understand the problem right, the error comes from that we might run past the modulus operation and skip the step where it goes to 0? If i do not miss remember we used to have cluster reinitialize on every moved error that we got back and that was why we added in the counter to give it some slack when we might get more then 1 moved error or it just got to be some random fluke or accident or something. Also would it not be the same if you just set self.reinitialize_steps=1 and it would kinda reinitialize on every MOVED error anyway?

mschfh commented 3 years ago

Exactly, it would be reset to 0 during (re)initialize, which is not triggered because the counter runs past the modulus. Any reason not to check for >= and reset it in increment_reinitialize_counter?

Setting reinitialize_steps would work, although it would be better to consider changing the defaults as this seems to be a common issue with ElastiCache especially.

Grokzen commented 3 years ago

Hmm, when impelmenting it, the idea was that if you got into a case where you would trigger a lot of reinitialize cases you would either hit it quickly or you would just overshoot and hit it again very quickly triggering it and it would recover eventually.

I am sure that solution can be explored to stop the modulus and reset it elsewhere if it goes above that limit. The main thing i guess would be threadsaftey if the reinitialize would trigger from multiple threads at once and overwrite eachother and they all would try to update it. But can check that out as well.

Since i am not super familiar with ElasticCache and the operations of it, what do i need to know if i want to experiment with that feature and try to simulate the same conditions you found the error in? Is a planned failover a case where you yourself shutdown a node or is that feature something that AWS supports us as users with to make some kind of failover and upgrade on demand?

mschfh commented 3 years ago

The current implementation does not seem threadsafe either if the counter is increasing fast enough.

What about setting a flag that a reinitialize is in progress and skipping the initialize() in this case (there might be some edge cases regarding exceptions within initialize()) ?

ElastiCache will occasionally replace nodes during a configurable maintenance window ("These replacements are needed to apply mandatory software updates to your underlying host"), they refused to provide information on the maintenance failover process, but it looks similar to CLUSTER FAILOVER:

after a replica is promoted, the DNS record is updated and the old master will send MOVED responses until it is shutdown (resulting in connection closed -> connection refused -> TCP timeouts, starting after ~1 minute)

The best way to reproduce is via vagrant or docker, the node termination should be simulated with iptables (just shutting the container down could result in refused connections instead of timeouts) or a tool like pumba (set 100% packetloss).

guytubul86 commented 3 years ago

Hi, any update about this issue? I also exposed to the same issue on aws maintenance window

jerroydmoore commented 3 years ago

With the latest version of redis-py-cluster (2.1.3), I would recommend setting reinitialize_steps=1, and retry logic to recreate the rediscluster client upon ClusterDownError or BusyLoadingError.

In my experiments with ElastiCache failover, the redis cluster will be unavailable for 1 minute (Throwing MovedError, ConnectionError, and ClusterDownError). With the default initialization_step and TTL value, I will also get BusyLoadingError as I suspect the new node may be assigned the same IP address but the client hasn't reached the conditions to re-examine the cluster layout.

mschfh commented 2 years ago

this was fixed in redis-py, after 5 retry attempts (with 250ms delay each), reinitialization is forced: https://github.com/redis/redis-py/blob/bea72995fd39b01e2f0a1682b16b6c7690933f36/redis/cluster.py#L1119-L1138

Grokzen / redis-py-cluster

TTL timeout during planned failover / ElastiCache maintenance #418