Closed mschfh closed 2 years ago
So if i understand the problem right, the error comes from that we might run past the modulus operation and skip the step where it goes to 0? If i do not miss remember we used to have cluster reinitialize on every moved error that we got back and that was why we added in the counter to give it some slack when we might get more then 1 moved error or it just got to be some random fluke or accident or something. Also would it not be the same if you just set self.reinitialize_steps=1
and it would kinda reinitialize on every MOVED error anyway?
Exactly, it would be reset to 0 during (re)initialize, which is not triggered because the counter runs past the modulus.
Any reason not to check for >=
and reset it in increment_reinitialize_counter
?
Setting reinitialize_steps
would work, although it would be better to consider changing the defaults as this seems to be a common issue with ElastiCache especially.
Hmm, when impelmenting it, the idea was that if you got into a case where you would trigger a lot of reinitialize cases you would either hit it quickly or you would just overshoot and hit it again very quickly triggering it and it would recover eventually.
I am sure that solution can be explored to stop the modulus and reset it elsewhere if it goes above that limit. The main thing i guess would be threadsaftey if the reinitialize would trigger from multiple threads at once and overwrite eachother and they all would try to update it. But can check that out as well.
Since i am not super familiar with ElasticCache and the operations of it, what do i need to know if i want to experiment with that feature and try to simulate the same conditions you found the error in? Is a planned failover a case where you yourself shutdown a node or is that feature something that AWS supports us as users with to make some kind of failover and upgrade on demand?
The current implementation does not seem threadsafe either if the counter is increasing fast enough.
What about setting a flag that a reinitialize is in progress and skipping the initialize()
in this case (there might be some edge cases regarding exceptions within initialize()
) ?
ElastiCache will occasionally replace nodes during a configurable maintenance window ("These replacements are needed to apply mandatory software updates to your underlying host"), they refused to provide information on the maintenance failover process, but it looks similar to CLUSTER FAILOVER
:
MOVED
responses until it is shutdown (resulting in connection closed -> connection refused -> TCP timeouts, starting after ~1 minute)The best way to reproduce is via vagrant or docker, the node termination should be simulated with iptables (just shutting the container down could result in refused connections instead of timeouts) or a tool like pumba (set 100% packetloss).
Hi, any update about this issue? I also exposed to the same issue on aws maintenance window
With the latest version of redis-py-cluster (2.1.3), I would recommend setting reinitialize_steps=1
, and retry logic to recreate the rediscluster client upon ClusterDownError
or BusyLoadingError
.
In my experiments with ElastiCache failover, the redis cluster will be unavailable for 1 minute (Throwing MovedError
, ConnectionError
, and ClusterDownError
). With the default initialization_step and TTL value, I will also get BusyLoadingError
as I suspect the new node may be assigned the same IP address but the client hasn't reached the conditions to re-examine the cluster layout.
this was fixed in redis-py, after 5 retry attempts (with 250ms delay each), reinitialization is forced: https://github.com/redis/redis-py/blob/bea72995fd39b01e2f0a1682b16b6c7690933f36/redis/cluster.py#L1119-L1138
the fix in #385 was incomplete, the cluster might not be reinitialized if
reinitialize_counter
is already increased: https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/client.py#L661-L664 https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/nodemanager.py#L348-L352During a planned ElastiCache failover, the old primary redirects requests to the new one by sending
MOVED
responses (thereby increasingreinitialize_counter
) before terminating the old primary. https://github.com/Grokzen/redis-py-cluster/blob/f0aaaa4e539bc62e38ce6e23839a12ff192cb7ea/rediscluster/client.py#L689An alternative approach to the counter would be to reinitialize on all MOVED responses (as discussed before), as those would always affect a larger amount of slots (at least with EC). This approach is also suggested in the redis documentation: