Conductor not able to acquire lock on redis for distributed locking

trivedishalini commented 2 years ago

Describe the bug Occurs only when due to some reason cluster restarted. Below error noticed whenever cluster restarted

ERROR com.netflix.conductor.locking.redis.RedisLock - Failed to acquireLock for lockId: f19a832a org.redisson.client.RedisException: ERR Error running script (call to f_dad): @user_script:1: @user_script: 1: -READONLY You can't write against a read only replica.. channel: [id: 0xece8af7f, L:/111.111.11.10:11111 - R:111.111.111.11/1111.111.111.12:5555] command: (EVAL), promise: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@46f3044a(incomplete)], params: [if (redis.call('exists', KEYS[1]) == 0) then redis.call('hincrby', KEYS[1], ARGV[2], 1); redis.call(..., 1, conductor.f19a832a, 60000, b46947ad:333]

Details Conductor version: 2.30.3 Persistence implementation: Postgress Queue implementation: Postgres Lock: Redis Workflow definition: N/A Task definition: N/A Event handler definition: N/A

To Reproduce Steps to reproduce the behavior:

Enable Redis locking using SENTINEL . Redis should work fine on first deployment.
restart the cluster
See the below error coming in conductor logs com.netflix.conductor.locking.redis.RedisLock - Failed to acquireLock

Expected behavior Redis pods should come up without any issue in case cluster restarted.

Screenshots N/A.

Additional context Both conductor and redis are running in a Kubernetes cluster.

Note

This working fine during first deployment of distributed locking until cluster not restarted
I tried redission:3.17.3 version with conductor:2.30.1 version but issue not resolved

v1r3n commented 2 years ago

Hi @trivedishalini when the cluster restarts, are other write operations working as expected? Can you share your redis lock configuration conductor.redis-lock.serverAddress maybe mask the hostname if needed.

trivedishalini commented 2 years ago

@v1r3n

actually this issue is during performance testing. So when cluster restarted, 50% of workflows not working due to this RedisLock issue . May be out of 10 workflow 3-4 working fine but rest giving this issue.
For now temporary resolution is , every time we need to restart all redis pod whenever cluster got restarted .
As we are using Conductor : 2.30.3 version below are the details

workflow.redis.locking.server.type=SENTINEL workflow.redis.locking.server.master.name=master workflow.redis.locking.server.address=redis://:26379 workflow.redis.locking.server.password=

workflow.decider.locking.enabled=true workflow.decider.locking.namespace=conductor workflow.decider.locking.server=REDIS workflow.locking.lease.time.ms=60000 workflow.locking.time.to.try.ms=500

manan164 commented 2 years ago

Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:

Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.

solution 2:

Or a faster way is through run this command

redis-cli -h 127.0.0.1 -p 6379 slaveof no one

aravindanr commented 2 years ago

@trivedishalini side note: v2.30.3 is no longer supported. It's recommended to migrate to the latest 3.x version.

trivedishalini commented 2 years ago

@trivedishalini side note: v2.30.3 is no longer supported. It's recommended to migrate to the latest 3.x version.

@aravindanr , Yes , we already moved to 3.x but still having one customer on 2.x .

trivedishalini commented 2 years ago

Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:

Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.

solution 2:

Or a faster way is through run this command

redis-cli -h 127.0.0.1 -p 6379 slaveof no one

Thanks @manan164 for suggestion . I will update here once I tried these.

flavioschuindt commented 2 years ago

Hi @trivedishalini , Looks like when the redis node is restarted, the replica is marked as read-only and conductor is trying to write to it. Quick google search shows these two options, solution 1:

Open the configuration file corresponding to the redis service and modify the value of the attribute slave-read-only to no, so that it can be written.

solution 2:

Or a faster way is through run this command

redis-cli -h 127.0.0.1 -p 6379 slaveof no one

Note that as per redis documentation, making a replica writable is strong not recommended:

You may wonder why it is possible to revert the read-only setting and have replica instances that can be targeted by write operations. The answer is that writable replicas exist only for historical reasons. Using writable replicas can result in inconsistency between the master and the replica, so it is not recommended to use writable replicas.

Therefore, I would put this option out of the game.

What I believe is happening in your case is this:

You start conductor just fine with redis and everything is working;
- Note that conductor here just call the k8s service which can forward it to ANY of the nodes (including the slaves).
You restart the cluster;
When the cluster comes back again, the k8s service might be redirecting your request to a slave, not a master and hence you see the failure.

AFAIK, Redis has basically two operation modes for HA: Cluster and sentinel. The thing about sentinel is that the client (conductor) must query the sentinel in order to get back the IP of the current master. Then after receiving back the IP it can connect to the master. More on it here. Like this: app (conductor) --> sentinel --> master info --> app (conductor) --> write data to master.

In cluster mode, this is not required because if you connect to a wrong node, that wrong node tells you right away the IP of the correct master, i.e., you would need only to follow redirection in the client. More about this here.

Now, with this in mind and looking at conductor code, we see this: https://github.com/Netflix/conductor/blob/main/redis-lock/src/main/java/com/netflix/conductor/redislock/config/RedisLockConfiguration.java#L79

This method of redisson library accepts a comma separated list for multiple address: https://github.com/redisson/redisson/blob/master/redisson/src/main/java/org/redisson/config/SentinelServersConfig.java#L128

Then maybe something to try is: Pass in conductor config redis lock properties the individual PODs ip and port instead of the service name. Note that this is not ideal as those IPs can change, but at least you can rule out this hypothesis. Another thing to consider is maybe adding a proxy (e.g. HAProxy) before the sentinel config which can adjust things directly for you in case of failover. More on it here.

ZergRushJoe commented 2 years ago

would adding .split(",") here like this help. If so I can make a pr to have it added.

Netflix / conductor

Conductor not able to acquire lock on redis for distributed locking #3037