Open trivedishalini opened 2 years ago
Hi @trivedishalini when the cluster restarts, are other write operations working as expected?
Can you share your redis lock configuration conductor.redis-lock.serverAddress
maybe mask the hostname if needed.
@v1r3n
actually this issue is during performance testing. So when cluster restarted, 50% of workflows not working due to this RedisLock issue . May be out of 10 workflow 3-4 working fine but rest giving this issue.
For now temporary resolution is , every time we need to restart all redis pod whenever cluster got restarted .
As we are using Conductor : 2.30.3 version below are the details
workflow.redis.locking.server.type=SENTINEL
workflow.redis.locking.server.master.name=master
workflow.redis.locking.server.address=redis://
workflow.decider.locking.enabled=true workflow.decider.locking.namespace=conductor workflow.decider.locking.server=REDIS workflow.locking.lease.time.ms=60000 workflow.locking.time.to.try.ms=500
Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:
Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.
solution 2:
Or a faster way is through run this command
redis-cli -h 127.0.0.1 -p 6379 slaveof no one
@trivedishalini side note: v2.30.3 is no longer supported. It's recommended to migrate to the latest 3.x version.
@trivedishalini side note: v2.30.3 is no longer supported. It's recommended to migrate to the latest 3.x version.
@aravindanr , Yes , we already moved to 3.x but still having one customer on 2.x .
Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:
Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.
solution 2:
Or a faster way is through run this command
redis-cli -h 127.0.0.1 -p 6379 slaveof no one
Thanks @manan164 for suggestion . I will update here once I tried these.
Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:
Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.
solution 2:
Or a faster way is through run this command
redis-cli -h 127.0.0.1 -p 6379 slaveof no one
Note that as per redis documentation, making a replica writable is strong not recommended:
You may wonder why it is possible to revert the read-only setting and have replica instances that can be targeted by write operations. The answer is that writable replicas exist only for historical reasons. Using writable replicas can result in inconsistency between the master and the replica, so it is not recommended to use writable replicas.
Therefore, I would put this option out of the game.
What I believe is happening in your case is this:
AFAIK, Redis has basically two operation modes for HA: Cluster and sentinel. The thing about sentinel is that the client (conductor) must query the sentinel in order to get back the IP of the current master. Then after receiving back the IP it can connect to the master. More on it here. Like this: app (conductor) --> sentinel --> master info --> app (conductor) --> write data to master.
In cluster mode, this is not required because if you connect to a wrong node, that wrong node tells you right away the IP of the correct master, i.e., you would need only to follow redirection in the client. More about this here.
Now, with this in mind and looking at conductor code, we see this: https://github.com/Netflix/conductor/blob/main/redis-lock/src/main/java/com/netflix/conductor/redislock/config/RedisLockConfiguration.java#L79
This method of redisson library accepts a comma separated list for multiple address: https://github.com/redisson/redisson/blob/master/redisson/src/main/java/org/redisson/config/SentinelServersConfig.java#L128
Then maybe something to try is: Pass in conductor config redis lock properties the individual PODs ip and port instead of the service name. Note that this is not ideal as those IPs can change, but at least you can rule out this hypothesis. Another thing to consider is maybe adding a proxy (e.g. HAProxy) before the sentinel config which can adjust things directly for you in case of failover. More on it here.
would adding .split(",") here like this help. If so I can make a pr to have it added.
Describe the bug Occurs only when due to some reason cluster restarted. Below error noticed whenever cluster restarted
ERROR com.netflix.conductor.locking.redis.RedisLock - Failed to acquireLock for lockId: f19a832a org.redisson.client.RedisException: ERR Error running script (call to f_dad): @user_script:1: @user_script: 1: -READONLY You can't write against a read only replica.. channel: [id: 0xece8af7f, L:/111.111.11.10:11111 - R:111.111.111.11/1111.111.111.12:5555] command: (EVAL), promise: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@46f3044a(incomplete)], params: [if (redis.call('exists', KEYS[1]) == 0) then redis.call('hincrby', KEYS[1], ARGV[2], 1); redis.call(..., 1, conductor.f19a832a, 60000, b46947ad:333]
Details Conductor version: 2.30.3 Persistence implementation: Postgress Queue implementation: Postgres Lock: Redis Workflow definition: N/A Task definition: N/A Event handler definition: N/A
To Reproduce Steps to reproduce the behavior:
Expected behavior Redis pods should come up without any issue in case cluster restarted.
Screenshots N/A.
Additional context Both conductor and redis are running in a Kubernetes cluster.
Note