LuaLock with block occasionally failing to exit in clustered mode

coandco commented 7 months ago

Expected Behaviour

Whenever you enter a LuaLock block, like so:

import asyncio
from coredis import RedisCluster
from coredis.recipes.locks import LuaLock
from coredis.exceptions import LockError

async def main():
    rclient = RedisCluster(startup_nodes=[{"host": "example", "port": 6372}])
    try:
        async with LuaLock(rclient, "examplename", blocking_timeout = 0.1, timeout=10):
            print("entered lock block")
            asyncio.sleep(1)
        print("exited lock block")
    except LockError:
        print("Couldn't acquire lock")
        return
    print("after lock")

if __name__ == "__main__":
    asyncio.run(main())

You should always see "exited lock block" and "after lock" if you saw "entered lock block".

Current Behaviour

I've got a decently-large distributed system with where multiple instances of the program across multiple boxes are all trying for the same lock once per minute, and whichever one wins executes what's in the lock block. What I'm seeing is that sometimes (mostly with lock-block contents that return quickly) it will enter the block, execute what's inside, and then just stop executing the function once it tries to exit the block.

Steps to Reproduce

It's a bug that's showing up sporadically in production, and takes around half an hour of four boxes contending for the lock once a minute to show up. I'm not sure how to reproduce it more reliably -- it's maybe a race condition of some kind?

Your Environment

coredis version: 4.16.0
Redis version: 6.0.16
Operating system: Debian 11
Python version: 3.9.2

coandco commented 7 months ago

Ah! I think I figured out what's going on here. Re-examining my logs, it looks like the failed runs are raising LockError("Cannot release a lock that's no longer owned"). So presumably what's happening is something like the following:

Acquire lock
Finish task quickly and thus cancel my lock-extender task
Go to release the lock and get preempted by other asyncio tasks
By the time the lock aexit code gets executed, the lock has expired

I had been mistakenly assuming that getting LockError meant that it couldn't acquire the lock, but it happens with lock extension and release as well. As it is, I think I need to distinguish between LockError("Could not acquire lock") and LockError("Cannot release a lock that's no longer owned"). Is there any chance they could be made into separate exception types that all inherit from the base LockError class? For now I think I'm going to have to just check str(e).

alisaifee commented 7 months ago

Thanks for digging into this @coandco!

Trying to reproduce this myself might take a while - but I'm happy to make a quick change to have explicit exceptions.

coandco commented 6 months ago

Thanks! I think being able to differentiate between lock acquisition errors and lock release errors without having to do string comparison would do what I need.

alisaifee commented 6 months ago

@coandco could you take a look at https://github.com/alisaifee/coredis/pull/227 to see if that satisfies your use case?

coandco commented 6 months ago

Hmm. It'd make sense to have lock acquisition separated out as well, I think. Otherwise it looks good.

alisaifee commented 6 months ago

Hmm. It'd make sense to have lock acquisition separated out as well, I think. Otherwise it looks good.

Good point - fixed in https://github.com/alisaifee/coredis/pull/227/commits/9bb60094ff2aebfac364c901df86fee6537b2c71

alisaifee commented 6 months ago

Specialized exceptions available in 4.17.0

alisaifee / coredis