awslabs / amazon-dynamodb-lock-client

The AmazonDynamoDBLockClient is a general purpose distributed locking library built on top of DynamoDB. It supports both coarse-grained and fine-grained locking.
Other
472 stars 85 forks source link

withShouldSkipBlockingWait will always fail to acquire lock if already exists and expired #93

Open cristianocca opened 1 year ago

cristianocca commented 1 year ago

Acquiring a lock with withShouldSkipBlockingWait(true) (so we don't block waiting) does not attempt to retrieve a lock that is expired (e.g., owner died / past lease time) as opposed to not using this option. Basically, if a lock is stale in DDB, using withShouldSkipBlockingWait will never be able to pick up the lock and it will consider it as already taken.

rahulvishwanatham commented 10 months ago

I am facing the same issue, any solution to this please? A lock is stale in my DDB and no other instance is able to acquire the lock indefinitely. Is there a way to save the lastUpdatedTime to DDB, so that the isExpired() method of LockItem works well.

a-lcardos commented 6 months ago

Facing the same issue.

Can be easily reproduced

 val lockClient = AmazonDynamoDBLockClient(
            AmazonDynamoDBLockClientOptions.builder(ddb, "test_lock")
                .withTimeUnit(TimeUnit.SECONDS)
                .withLeaseDuration(5L)
                .withCreateHeartbeatBackgroundThread(false)
                .build())

        val lock1 = lockClient.acquireLock(
            AcquireLockOptions
                .builder("myTest")
                .withAdditionalTimeToWaitForLock(0)
                .withTimeUnit(TimeUnit.SECONDS)
                .build()
        )
        log.info { "First lock $lock1" }

        Thread.sleep(10000)
        log.info { "Slept a 10 sec. Lease time is 5 sec. Attempting to acquire lock again" }

        val lock2 = lockClient.acquireLock(
            AcquireLockOptions
                .builder("myTest")
                .withShouldSkipBlockingWait(true)
                .withAdditionalTimeToWaitForLock(1)
                .withTimeUnit(TimeUnit.SECONDS)
                .build()
        )
        log.info { "Lock $lock2" }

Fails in attempt to acquire the lock2

    01 Mar 2024 11:38:31,279 [INFO]  (Test worker): First lock LockItem{Partition Key=myTest, Sort Key=Optional.empty, Owner Name=---, Lookup Time=349680572, Lease Duration=5000, Record Version Number=a7c31f70-67ca-4105-af95-0bd75193df52, Delete On Close=true, Data=, Is Released=false}
    01 Mar 2024 11:38:41,289 [INFO]  (Test worker): Slept a 10 sec. Lease time is 5 sec. Attempting to acquire lock again

    com.amazonaws.services.dynamodbv2.model.LockCurrentlyUnavailableException: The lock being requested is being held by another client.
        at app//com.amazonaws.services.dynamodbv2.AmazonDynamoDBLockClient.acquireLock(AmazonDynamoDBLockClient.java:483)
        at app//com.integ.DynamoDbIntegrationTest.sample example(DynamoDbIntegrationTest.kt:136)
a-lcardos commented 6 months ago

README says this should work in a different way

If the lock does not exist or if the lock has been acquired by the other machine and is stale (has passed the lease duration), this would successfully acquire the lock.

Issue is happening because the isExpired checks if its stale actually using the lock lookupTime

return LockClientUtils.INSTANCE.millisecondTime() - this.lookupTime.get() > this.leaseDuration.get();

The lookupTime is the time the item was read from the DynamoDb table

https://github.com/awslabs/amazon-dynamodb-lock-client/blame/master/src/main/java/com/amazonaws/services/dynamodbv2/AmazonDynamoDBLockClient.java#L987 https://github.com/awslabs/amazon-dynamodb-lock-client/blame/master/src/main/java/com/amazonaws/services/dynamodbv2/AmazonDynamoDBLockClient.java#L1020

By reading the documentation I was interpreting "stale" and "lease time" meaning: if an item is added at X time and the lease time is L, at time Y where Y>X+L the lock would be stale (or expired).

The implementation does not translate to that exactly.

When using withShouldSkipBlockingWait(false) if the lock is stale (my definition of stale) it takes the whole additional lease time plus the additional time you passed for it to acquire the lock.

Can be easily reproduced by changing the lock2 in the example above to use withShouldSkipBlockingWait(false)

    01 Mar 2024 11:55:37,123 [INFO]  (Test worker): Slept a 10 sec. Lease time is 5 sec. Attempting to acquire lock again
    01 Mar 2024 11:55:42,736 [INFO]  (Test worker): Lock LockItem{Partition Key=myTest, Sort Key=Optional.empty, Owner Name=----, Lookup Time=350712085, Lease Duration=5000, Record Version Number=0c6e2944-0819-456f-aec1-6f5a53428883, Delete On Close=true, Data=, Is Released=false}

(it took 5 seconds to acquire the lock after the 10 second sleep. The lease time of this lock was 5 seconds. The attempt to get the second lock should have immediately acquired the lock if the definition of stale was the one I added here)

evanhat commented 6 months ago

+1 @a-lcardos

I had withShouldSkipBlockingWait set to true, and I just spent 2 hours trying to figure out why the lease expiration wasn't working lol.

If my understanding is correct, it looks like the only way for a lock to "expire" (without calling release) is to call acquireLock() and let it block for the entire lease duration (?)

It seems like there's no way you can use this to have a non-blocking lock acquire w/ lease expiration. Maybe you could work around it by setting a ttl on you table items, but that's just super not ideal.

cristianocca commented 6 months ago

This was fixed in the internally-maintained version of this package. Would be good to have the fix shipped to the public version as well.

evanhat commented 6 months ago

@cristianocca It looks like you're right! I guess this was resolved on branch v1.3.x 👀.

evanhat commented 6 months ago

@cristianocca It looks like you're also right about that version not being shipped to the public 🥲.