awslabs / amazon-kinesis-client

Client library for Amazon Kinesis
Apache License 2.0
641 stars 463 forks source link

Faster acquisition of leases from workers that have been gracefully shutdown? #845

Open rtyley opened 3 years ago

rtyley commented 3 years ago

New instances seem to take a relatively long time to acquire leases after old instances have stopped renewing them (seen in KCL 2.3.6, probably other versions too) - apart from incremental lease stealing, the new instances seem to have to wait for the full failoverTimeMillis (which is referred to as leaseDurationMillis within DynamoDBLeaseCoordinator & leaseDurationNanos within DynamoDBLeaseTaker) before a lease is considered expired and they can take it. With failoverTimeMillis set to just 30s, I've seen new instances take ~3 minutes to take all leases from old instances in a deployment (old instances were fully terminated by 16:57:40 in the graph below, but the new record processors weren't all initialised until 17:00:35).

image

Although failover timeout is obviously good for handling ShardRecordProcessors that become unresponsive, if an instance is smoothly shutting down (eg scheduler.startGracefulShutdown() has been called), couldn't it be clearly indicated that the old scheduler is no longer responsible for the lease by invoking evictLease() (setting the lease owner to null) on the leases it still holds during graceful shutdown? This would be after shutdownRequested() is called on the ShardRecordProcessor & before leaseLost(). It could possibly be in s.a.k.lifecycle.ShutdownTask.call()?

Having done this, DynamoDBLeaseTaker.takeLeases() could be more acquisitive, and as well as taking expired leases, could take unassigned ones too (see https://github.com/awslabs/amazon-kinesis-client/pull/848) - so in the case of a graceful shutdown of old processors, the handover could be much quicker than waiting failoverTimeMillis.

Does this make sense?! Or could it be plagued by all sorts of race-conditions or complexity that you're probably very careful to avoid?! Just from an education point of view, I'd be interested to learn what problems the approach has.

bjrara commented 2 years ago

Since failoverTimeMillis is set to 30s, all the leases should be treated as expired after that duration if the previous owner fails to renew the expiry time. In the case described here, old instances stopped at 16:57:40, meaning their leases were available for picking up at 16:58:10. However, according to the graph, we found new instances actually started to work on taking more leases at 16:58:30, for one instance, it happened as late as 16:59:30. There were 50 to 110 seconds time differences.

So I think there're other reasons causing the slow acquisition.

One thing I noticed is how long the lease taker is scheduled to execute. IIUC, the interval of lease taking is by default 2x failoverTimeMillis. In your case, it was 1 minute. https://github.com/awslabs/amazon-kinesis-client/blob/a3e51d55959c0ce364cc6816386fb1b7cbc8a852/amazon-kinesis-client/src/main/java/software/amazon/kinesis/leases/dynamodb/DynamoDBLeaseCoordinator.java#L161

If all the new instances just executed LeaseTaker right before the leases expired (16:58:10) we would have to wait until 16:59:10 to observe the first lease acquisition event.

It may help explain why 2 of the new instances started to take lease around 16:58:30.

However, it seems to me that the new instances didn't think all the leases were expired at 16:58:30, and the leases were not balanced among the 3 workers. This is beyond my expectation.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

rtyley commented 2 years ago

Thanks for your response @bjrara ! If you have time, I'd be very grateful if you could review PR https://github.com/awslabs/amazon-kinesis-client/pull/848, as I believe it could be a good solution to the failover problem for the common case that current-lease-holders shutdown gracefully, while retaining the current failover-timeout behaviour for cases where lease-holders do not shutdown gracefully.

It would very helpful if you could post more logs in LeaseTaker to help us to understand how the cluster made the decision at that time.

I don't have the logs from the deploy mentioned in the original description for this issue (where failoverTimeMillis was 30s) but I have grabbed current production logs from a fresh deploy today (where failoverTimeMillis has been reduced to the default value of 10s to ameliorate this issue).

image

The issue here is less exaggerated because the failoverTimeMillis is lower, but still present: 27s elapses between the last old EC2 instance being terminated (at 10:27:04) before all shards are being processed by the new EC2 instances (10:27:31). Here the delay is roughly 3x what a failoverTimeMillis of 10s might suggest, though given the expression you point out to calculate takerIntervalMillis, that behaviour would be expected - that expression has been around for 9 years, since KCL v1.0.0, but I haven't seen any description of why it's a good idea?

The thing is, I don't believe there's really any reason why we have to wait for failoverTimeMillis to elapse when an old worker has shutdown gracefully - why can't the old worker just mark the lease as no-longer-owned (ie unset leaseOwner in the Kinesis table) during its graceful shutdown, and newer workers consider leases without owners as available to use, as in https://github.com/awslabs/amazon-kinesis-client/pull/848 ?

bjrara commented 2 years ago

@rtyley I agree that lease eviction can definitely help expedite the process of lease transfer. Just to let you know, I'm not a maintainer of this repository, but I'd like to give my vote in your PR if that could help when KCL team evaluate the changes.

mathieufortin01 commented 1 year ago

Hey there. Any updates on your side regarding this situation? We suffer from this in production, and im leaning toward integrating your pr, as well as completing the work described in your original comment, and see from there.