awslabs / amazon-kinesis-client

Client library for Amazon Kinesis
Apache License 2.0
632 stars 461 forks source link

Fix a race condition between ShardConsumer shutdown and initialization #1319

Closed akidambisrinivasan closed 2 months ago

akidambisrinivasan commented 2 months ago

When Kinesis shards have no data, there can be a race condition where the shard-end record processing from RecordProcessorThread interleaves with Scheduler performing initialization. This leads to ShardConsumer making incorrect state transition during initialization (moves from PROCESSING -> SHUTTING_DOWN) state and during shutdown handling it moves from SHUTTING_DOWN -> SHUTDOWN_COMPLETE without running the ShutdownTask.

This can cause the ShardConsumer to not perform proper shutdown processing that is required for a child shard processing to be unblocked. So the child shard could be blocked forever unless the lease for the parent shard moves to a new worker and that worker does not run into the race condition.

This patch fixes the race condition as follows:

The intializationComplete invocation is not needed after needsInitialization has been set to false. Because initializationComplete is mean to perform initialization in an async manner, but once its done, the async task is a no-op in happy-path, but it can perform incorrect state transition during a race condition.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Issue: #837

akidambisrinivasan commented 2 months ago

Is there a unit test we can add for this so that we can make sure this change is tested. Otherwise looks good

Im looking into it, its hard to reproduce the race in UT because the ShardConsumer constructs its own subscriber. Wanted to get the fix in first and I am continuing to investigate if UT is possible, will add it in a separate patch, if possible.