Perform each operation once/manually setting RedisClusterRequestTTL value

raymondyou97 commented 3 years ago

This is more of a question and not an issue, but I am running some ElastiCache Redis Cluster instances and they typically run into TTL Exhausted Errors which seems to be a generic error along with a surge in connections and a surge in CPU utilization. Though, I believe the main culprit might be the surge in connections causing everything. Hence, the rest of the text below.

I believe it might be due to how whenever a command fails for some reason it retries 16 times until it actually fails, in the _execute_command() code and this just spirals out of control.

Is it fine to change the RedisClusterRequestTTL value from 16 to 1 so it only tries each operation once? I'm not sure if there are any adverse implications in doing this, if it is even possible, or if 16 is some special number.

I would love to know the exact reason why it ends up with the TTL exhausted error, but for now I want to ensure any operation including client initilization will happen once and if it fails, it just fails. I have retry_on_timeout = False, socket_timeout and socket_connect_timeout relatively low values, and cluster_down_retry_attempts = False. The last thing is RedisClusterRequestTTL.

Grokzen commented 3 years ago

@raymondyou97 There no really special reason why 16 is the TTL value more then that value was chosen by antirez back in the reference client implementation he did in ruby.

If you want to put the value to 1, that is your choise and there wont be any other problem more then that if your cluster has a node failing, a slot was moved or any other error, the execute_command method call will exit out directly back to your code, and you will miss the nice feature of your client self repair itself and continue to move along. So i would probably then say that 1 is probably a poor TTL really. In practise really, 4 is probably the lowest i would go down to. In that case you get the benefits of the lib, but you will speed up the failure scenario where the client can't recover out and raise back to your own code for handling the error.

If you run redis-py-cluster before 2.1.0 then you have to manually inject logging into the _execute_command() method to determine what exception is really the cause for the TTL problem in the end.

If you run the 2.1.0 or later release, there is a logging message for each indiviaul exception that is happening

Also note that the execute method will only attempt to catch any error that we know and that is cluster related, so if you happen to get some other deeper error that is unknown or unable to handle, it would be raised back to you.

yanchidezhang commented 1 year ago

@raymondyou97 hi raymond, i was looking for a way changing RedisClusterRequestTTL without modifying the py files. Did you know any possible approach to change this? (like during the declaration of client, shall i change this value, 16 is lil bit too much for me)

raymondyou97 commented 1 year ago

@raymondyou97 hi raymond, i was looking for a way changing RedisClusterRequestTTL without modifying the py files. Did you know any possible approach to change this? (like during the declaration of client, shall i change this value, 16 is lil bit too much for me)

We ended up forking it and adding request_ttl as an optional, client input here. Setting this value to 1 resolved our needs as now when the redis-cluster is having issues and unable to return a successful response, the clients aren't thundering herd retrying on the server, causing it unable to recover gracefully.

Grokzen / redis-py-cluster

Perform each operation once/manually setting RedisClusterRequestTTL value #413