Closed coandco closed 6 months ago
Ah! I think I figured out what's going on here. Re-examining my logs, it looks like the failed runs are raising LockError("Cannot release a lock that's no longer owned")
. So presumably what's happening is something like the following:
I had been mistakenly assuming that getting LockError meant that it couldn't acquire the lock, but it happens with lock extension and release as well. As it is, I think I need to distinguish between LockError("Could not acquire lock")
and LockError("Cannot release a lock that's no longer owned")
. Is there any chance they could be made into separate exception types that all inherit from the base LockError class? For now I think I'm going to have to just check str(e)
.
Thanks for digging into this @coandco!
Trying to reproduce this myself might take a while - but I'm happy to make a quick change to have explicit exceptions.
Thanks! I think being able to differentiate between lock acquisition errors and lock release errors without having to do string comparison would do what I need.
@coandco could you take a look at https://github.com/alisaifee/coredis/pull/227 to see if that satisfies your use case?
Hmm. It'd make sense to have lock acquisition separated out as well, I think. Otherwise it looks good.
Hmm. It'd make sense to have lock acquisition separated out as well, I think. Otherwise it looks good.
Good point - fixed in https://github.com/alisaifee/coredis/pull/227/commits/9bb60094ff2aebfac364c901df86fee6537b2c71
Expected Behaviour
Whenever you enter a LuaLock block, like so:
You should always see "exited lock block" and "after lock" if you saw "entered lock block".
Current Behaviour
I've got a decently-large distributed system with where multiple instances of the program across multiple boxes are all trying for the same lock once per minute, and whichever one wins executes what's in the lock block. What I'm seeing is that sometimes (mostly with lock-block contents that return quickly) it will enter the block, execute what's inside, and then just stop executing the function once it tries to exit the block.
Steps to Reproduce
It's a bug that's showing up sporadically in production, and takes around half an hour of four boxes contending for the lock once a minute to show up. I'm not sure how to reproduce it more reliably -- it's maybe a race condition of some kind?
Your Environment