awslabs / amazon-kinesis-client

Client library for Amazon Kinesis
Apache License 2.0
641 stars 465 forks source link

Race condition in KCL v2 graceful shutdown #542

Open vtlkvl opened 5 years ago

vtlkvl commented 5 years ago

When graceful shutdown is requested via Scheduler.startGracefulShutdown call, it often happens that all active leases get removed from Scheduler.shardInfoShardConsumerMap before shutdown of record processors is complete and GracefulShutdownContext.shutdownCompleteLatch gets down to 0. This leads to a problem in GracefulShutdownCallable.waitForRecordProcessors:

while (!context.shutdownCompleteLatch().await(1, TimeUnit.SECONDS)) {
    if (Thread.interrupted()) {
        throw new InterruptedException();
    }
    log.info(awaitingFinalShutdownMessage(context));
    if (workerShutdownWithRemaining(context.shutdownCompleteLatch().getCount(), context)) {
        return false;
    }
}

Under normal conditions shutdown complete latch should eventually count down to 0 and future returned by Scheduler.startGracefulShutdown should yield true. Because of a race condition, shutdown complete latch holds a non-zero value and GracefulShutdownCallable.workerShutdownWithRemaining returns true because Scheduler.shardInfoShardConsumerMap is already empty at this point while Scheduler has not finished shutdown process. As a result future returned by Scheduler.startGracefulShutdown yields false. As a workaround to get notified about shutdown completion it is required to check Scheduler.shutdownComplete in a loop until it returns true.

BobbyJohansen commented 5 years ago

Is it possible this causes a shardRecordProcessor not to finish shutdown but to continue executing ProcessTasks ?

vtlkvl commented 5 years ago

Not sure, at least we haven't observed it. Based on my personal analysis of the code, it should not happen.

aggarwal commented 5 years ago

I think some shard consumer is stuck in a failure loop and holding up the graceful shutdown. See #616 for ideas to debug this further.

gabrielfmagalhaes commented 5 months ago

I believe that this is related https://github.com/awslabs/amazon-kinesis-client/pull/1302