awslabs / kinesis-kafka-connector

kinesis-kafka-connector is connector based on Kafka Connect to publish messages to Amazon Kinesis streams or Amazon Kinesis Firehose.
Apache License 2.0
153 stars 91 forks source link

Guaranteed Deliver / Avoid Duplicate Records #9

Open pradeepcheers opened 7 years ago

pradeepcheers commented 7 years ago

Guaranteed Delivery - Since the connector uses call back we can't guarantee delivery of the messages. How can we make sure guaranteed delivery of the records? Currently the connector follow FIRE-AND-FORGET schematics.

Avoid duplication - Since we fetch records from Kafka records in batches, if KPL fails for some connectivity issue then we end up having duplicate records in Kinesis (because we cannot control Kafka offset)

benalta commented 7 years ago

Can you advise your use case? Why dont you use kinesis firehose as a destination.

pradeepcheers commented 7 years ago

@benalta

Thanks for your reply.

I have some events in Kafka topics and I have to publish them to Kinesis streams and Kinesis firehose is not an option for me! So I'm planning to use this connector. However in general this connector doesn't guarantee exactly once delivery. Either we end up having duplicates or loosing messages in case of a exception while processing a batch of records.

benalta commented 7 years ago

Thanks. You will need to implement de-dup functionality. KPL has retry mechanism but the scenario you describe may end with duplication. Bear in mind that we use the Kafka Connect Library and you may need also to implement it there as well.

jessecollier commented 7 years ago

I too have this concern. Here's what I'm doing that may help. Warning this may reduce your throughput capacity and/or require more shards to handle shard limitations.

I've not tested this. But based on my comprehension this should get you to exactly once, and only moving the commit offset in kafka forward once it has produced to kinesis.

@nehalmehta can you confirm this approach ?

nehalmehta commented 7 years ago

Hi @jessecollier,

Exactly once delivery (Guaranteed) is difficult to achieve in any streaming service. Changing aggregation to false and using synchronous flush will absolutely help but cannot guarantee exactly once delivery. Let's take one of the edge case: If connector sends message to Kinesis using synchronous flush and without aggregation, but due to network lag it does not gets responses within request timeout period and in that case connector will assume message delivery has failed and will attempt to publish one more time. In some cases it may have reached Kinesis endpoint after request timeout period and so we will have duplication.

In my opinion it is better to have de-duplication functionality on consumer side in this use case.

Thanks, Nehal