awslabs / amazon-kinesis-client

Client library for Amazon Kinesis
Apache License 2.0
641 stars 463 forks source link

KCL doesn't handle lease record delete in DynamoDB #461

Open logarithm opened 5 years ago

logarithm commented 5 years ago

If we delete lease record in DDB while consumers are running, then neither consumer which was holding lease nor any other currently running consumers consume from that shard. Only restart helps.

2018-11-03 02:56:01,138 INFO  [main] software.amazon.kinesis.leases.dynamodb.DynamoDBLeaseCoordinator - With failover time 1000 ms and epsilon 25 ms, LeaseCoordinator will renew leases every 308 ms, takeleases every 2050 ms, process maximum of 2147483647 leases and steal 1 lease(s) at a time.
2018-11-03 02:56:01,143 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initialization attempt 1
2018-11-03 02:56:01,144 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initializing LeaseCoordinator
2018-11-03 02:56:02,291 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Syncing Kinesis shard info
2018-11-03 02:56:02,891 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Starting LeaseCoordinator
2018-11-03 02:56:05,136 INFO  [LeaseCoordinator-0000] software.amazon.kinesis.leases.dynamodb.DynamoDBLeaseTaker - Worker 32316866-5972-4fd6-89c0-b9e0442c23eb saw 1 total leases, 1 available leases, 1 workers. Target is 1 leases, I have 0 leases, I will take 1 leases
2018-11-03 02:56:05,214 INFO  [LeaseCoordinator-0000] software.amazon.kinesis.leases.dynamodb.DynamoDBLeaseTaker - Worker 32316866-5972-4fd6-89c0-b9e0442c23eb successfully took 1 leases: shardId-000000000000
2018-11-03 02:56:12,962 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initialization complete. Starting worker loop.
2018-11-03 02:56:12,974 INFO  [Thread-1] com.max.plain.KinesisConsumerMain - Creating new shard record processor #1
2018-11-03 02:56:12,996 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Created new shardConsumer for : ShardInfo(shardId=shardId-000000000000, concurrencyToken=a7998b87-1c8d-4be9-8886-9dcee690cd1d, parentShardIds=[], checkpoint={SequenceNumber: 49589774881029481079468922693338579768311814635187601410,SubsequenceNumber: 0})
2018-11-03 02:56:12,999 INFO  [ShardRecordProcessor-0000] software.amazon.kinesis.lifecycle.BlockOnParentShardTask - No need to block on parents [] of shard shardId-000000000000
2018-11-03 02:56:14,073 INFO  [ShardRecordProcessor-0000] software.amazon.kinesis.retrieval.polling.KinesisDataFetcher - Initializing shard shardId-000000000000 with 49589774881029481079468922693338579768311814635187601410
2018-11-03 02:56:14,143 INFO  [ShardRecordProcessor-0000] software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher - Starting prefetching thread.
2018-11-03 02:56:14,144 INFO  [ShardRecordProcessor-0000] com.max.plain.KinesisConsumerMain - Initializing for shardId shardId-000000000000, sequence: {SequenceNumber: 49589774881029481079468922693338579768311814635187601410,SubsequenceNumber: 0}
// Processing records
// Delete record in DDB
2018-11-03 02:57:05,685 INFO  [LeaseRenewer-0000] software.amazon.kinesis.leases.dynamodb.DynamoDBLeaseRenewer - Worker 32316866-5972-4fd6-89c0-b9e0442c23eb lost lease with key shardId-000000000000
2018-11-03 02:57:05,798 INFO  [ForkJoinPool.commonPool-worker-1] com.max.plain.KinesisConsumerMain - Lost lease, so terminating
2018-11-03 02:57:05,798 INFO  [prefetch-cache-shardId-000000000000-0000] software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher - Thread was interrupted, indicating shutdown was called on the cache.

Second consumer:

2018-11-03 02:56:27,486 INFO  [main] software.amazon.kinesis.leases.dynamodb.DynamoDBLeaseCoordinator - With failover time 1000 ms and epsilon 25 ms, LeaseCoordinator will renew leases every 308 ms, takeleases every 2050 ms, process maximum of 2147483647 leases and steal 1 lease(s) at a time.
2018-11-03 02:56:27,491 INFO  [main] com.max.plain.KinesisConsumerMain - Press enter to shutdown
2018-11-03 02:56:27,492 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initialization attempt 1
2018-11-03 02:56:27,492 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initializing LeaseCoordinator
2018-11-03 02:56:28,712 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Syncing Kinesis shard info
2018-11-03 02:56:29,355 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Starting LeaseCoordinator
2018-11-03 02:56:39,424 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Initialization complete. Starting worker loop.
2018-11-03 02:57:27,569 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 02:57:27,569 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 02:58:28,771 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 02:58:28,771 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 02:59:29,947 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 02:59:29,947 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 03:00:31,133 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 03:00:31,133 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 03:01:32,317 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 03:01:32,317 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 03:02:33,475 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 03:02:33,475 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 03:03:34,664 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 03:03:34,665 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
2018-11-03 03:04:35,856 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - No activities assigned
2018-11-03 03:04:35,857 INFO  [Thread-1] software.amazon.kinesis.coordinator.Scheduler - Sleeping ...
pfifer commented 5 years ago

What is your goal in deleting the lease from DynamoDB?

logarithm commented 5 years ago

It's not our goal to delete lease from DynamoDB, just highlighted the case when it's silently dies, but I think it might be better if it fails with error, or even handle that situation, because data lost can happen, despite that it's a rare case.

dbarvitsky commented 3 years ago

The use case for cleaning the DynamoDB is to start a hot replay of the consumer without restarting it. Less of an issue for us, because our consumers are running in Kubernetes; so we just scale the appropriate stateful set to 0, wipe out the table, then restart them with appropriate point to replay from. For someone running them in actual EC2s restarting nodes / re-deploying to replay may be annoying.