Cannot find the shard given the shardId - Non stop log spam

rwightman commented 6 years ago

I was having issues as per #20 for the longest time. Finally with the latest updates to this library and KCL 1.9.2 I don't seem to be having stuck streams.

However, I am constantly seeing these logs spamming at a warning level:

2018-10-23 17:41:56 WARN c.a.s.d.s.DynamoDBStreamsProxy - Cannot find the shard given the shardId shardId-xxxxx 2018-10-23 17:41:56 WARN c.a.s.k.c.lib.worker.ProcessTask - Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during deaggregation of Amazon Kinesis records.

I've looked up some issues in the KCL library but nothing providing a solid answer to my situation there. It seems somehow related to the way the DynamoDB streams proxy works as I wasn't seeing this log spam on the KCL side until I updated to the latest version of this code and started using the new construction method.

I've check the leases in DynamoDB I've often seen only one lease entry for the shard being complained about. It's a very simple setup right now, based on the sample code. Two worker threads in a single process right now processing the streams and usually only one shard.

I've seen:

one thread with a valid shard assigment, one thread with no assignments,
one thread with an invalid (cannot find, spamming), one with no assignments
one thread with a valid assignment, one thread with an invalid assignment (cannot find, spamming)

klesniewski commented 5 years ago

Looks like a duplicate of: https://github.com/awslabs/amazon-kinesis-client/issues/55

rwightman commented 5 years ago

@klesniewski yes, I noticed that issue. In my situation specifically though, I wasn't having the warning issue with any sort of frequency until after the problems you noted with the streams getting stuck were fixed. Suggesting that perpaps the sequence of events that was causing the stuck shards is fairly reliably putting things in a state where the warnings shardId warnings are triggered...

klesniewski commented 5 years ago

From what I understood from awslabs/amazon-kinesis-client#55, the issue was introduced with KinesisProxy - the same one that was used to resolve #20. The problem seems to be caused by KinesisProxy keeping cached list of shards and not refreshing those on lease steal.

mrhota commented 4 years ago

I think I have the same issue, although we also see non-stop ERROR level spam like:

ERROR [2020-02-27 13:02:45,382] [RecordProcessor-2873] c.a.s.k.c.lib.worker.InitializeTask: Caught exception: 
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Unable to fetch checkpoint for shardId shardId-00000001582460850801-53f6f94b
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.getCheckpointObject(KinesisClientLibLeaseCoordinator.java:286)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitializeTask.call(InitializeTask.java:82)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

awslabs / dynamodb-streams-kinesis-adapter

Cannot find the shard given the shardId - Non stop log spam #21