Detect stale locks by heartbeat period rather than lease duration

mmindenhall commented 5 years ago

We are looking for a way to solve the problem of lambdas being invoked multiple times for the same event. We have some long-running lambdas (processing uploaded files up to 10GB in size), and sometimes we get the Lambda triggered a 2nd (or 3rd) time while the first invocation is still running normally.

I thought we could use this library to acquire a lock in the lambda, but if the lease duration approaches the runtime limit of a lambda (currently 15 minutes), there will never be enough time after detecting a stale lock for the next client to do its work. For example:

We have some files that take 12-15 minutes to process, so we want to set our maximum lease time to 15 minutes.
If a lambda invocation crashed without releasing the lock, another lambda would wait the lease duration (15 minutes) to determine that the lock was stale. Thus, the second lambda will use its entire invocation time limit just waiting to acquire the lock.

Would there be any drawback to using the heartbeat period (plus some configurable buffer) to detect stale locks instead of the lease duration? This would require the heartbeatPeriod being added to the item stored. But having done that, the How we handle clock skew section could become:

...a call to acquireLock reads in the current lock, checks the RecordVersionNumber of the lock (which is a GUID) and starts a timer. If the lock still has the same GUID after the ~lease duration time~ heartbeat period (plus a buffer) has passed, the client will determine that the lock is stale and expire it.

With this approach, I could set my lease duration to 15min, and my heartbeat period to something like 5s (plus 5s buffer). Then I could detect a stale lock within ~10s rather than 15 minutes.

DALDEI commented 5 years ago

This is the precise reason I abandoned this library, yet have not found a better one. The 'happy path' generally takes too long to an abandoned lock which leads to the design pattern of not calling the locking API at any level of granularity smaller then the application lifetime. Which leads to more frequent issues of abandoned locks due to apps crashing rather then releasing locks, which wouldn't happen as often if the API did not block for as long in the first place (if it acquired free locks quickly then the API would be used at a finer granularity causing a higher chance of successfully releasing the locks -- but then a more frequent call to the API which in turn increases the chances of some caller not releasing the lock in time)
This is not as easy as it sounds ("just release your locks!") sorry. The most significant use case for my code is to provide a 'Only One' enforcement of a docker or ecs instance application running. When applying updates -- the 'down period' needs to be minimal -- that is done by either killing off the old app and starting a new one as fast as possible -- or in the much friendlier modern world, a rolling update. (ECS promotes the later to the extent that its very difficult to implement the former). This consistently results in either A) hard termination of a running app (it wont release its locks) B) A new app coming up before the old app is 'retired' -- which blocks to the point that the health checks can not simultaneously be healthy on both host and also be correct.
So ... to satisfy B I can 'kill off' the draining version of the app to speed up the replacement, but then that just makes the lock aquire take even longer ... putting the service out of commision for needless minutes

ironcream commented 4 years ago

@mmindenhall @DALDEI you probably don't need a lock for 15 minutes (or any other long time). What you need to use is a heartbeat feature. You want a lock with the lease time of, say, 30 seconds and a heartbeat of something like 5 seconds (or less).

Documentation recommends having lease duration at least three times larger than a heartbeat to ensure successful heartbeating.

The way to understand it is: each HeartBeat prolongs the log for LeaseDuration. This way when your lambda fails without properly releasing the lock its background thread dies as well, it stops hearbeating it's lock, it gets abandoned, then another lambda would be able to pick it up within LeaseDuration.

On the side which tries to acquire the lock you also have some flexibility. You can chose to:

fail immediately when could not acquire
wait for at least LeaseDuration when trying to acquire (default)
wait for LeaseDuration + AdditionalTime
implement your own back-off/retry strategy

This is also described in the readme:

...
When you call the constructor AmazonDynamoDBLockClient, you can specify createHeartbeatBackgroundThread=true and it will spawn a background thread that continually updates the record version number on your locks to prevent them from expiring (it does this by calling the sendHeartbeat() method in the lock client.) 
This will ensure that as long as your JVM is running, your locks will not expire until you call releaseLock() or lockItem.close()

DALDEI commented 4 years ago

Thank you for the explanation -- even after reading all the code and code comments and readme etc a few dozen times, its still not obvious :) Unfortunately Im off to a different problem -- a serverless locking service which does not have to 'stay alive' to issue 'once and only once' tokens. Currently experimenting with step function activities, one of the few near-guaranteed 'once only' consistency services on AWS. The other one I contemplated is using FIFO SQS queues as a kind of 'semaphore' -- where each SQS message is the semaphore object itself -- and never gets acknowledged only read and its visibility time extended for the 'lock'

ctron-te commented 4 years ago

I have the same problem. I thought this library would help me with locking lambdas but it doesn't seem to.

It's missing the ability for a lock to expire if it wasn't updated for x ms rather than requiring a new consumer to wait for xms to expire the lock.

I think the table is missing a timestamp of when the lock was created so we can know when a lock is expired.

aquinney0 commented 1 year ago

I think the table is missing a timestamp of when the lock was created so we can know when a lock is expired.

I don't see how this was not a baseline feature from day 1. How is it preferable that ANY lock in the table must live its full lease duration unless another client releases it, even if that lock has been dangling for years?

aquinney0 commented 1 year ago

I noticed this PR:

https://github.com/awslabs/amazon-dynamodb-lock-client/pull/79

It adds a 'lookup time' attribute to the stored lock and when it is read back it uses that value instead of the current time, which would basically solve this issue.

awslabs / amazon-dynamodb-lock-client

Detect stale locks by heartbeat period rather than lease duration #34