Open pradeepcheers opened 7 years ago
Can you advise your use case? Why dont you use kinesis firehose as a destination.
@benalta
Thanks for your reply.
I have some events in Kafka topics and I have to publish them to Kinesis streams and Kinesis firehose is not an option for me! So I'm planning to use this connector. However in general this connector doesn't guarantee exactly once delivery. Either we end up having duplicates or loosing messages in case of a exception while processing a batch of records.
Thanks. You will need to implement de-dup functionality. KPL has retry mechanism but the scenario you describe may end with duplication. Bear in mind that we use the Kafka Connect Library and you may need also to implement it there as well.
I too have this concern. Here's what I'm doing that may help. Warning this may reduce your throughput capacity and/or require more shards to handle shard limitations.
Use synchronous flush (to address the fire-and-forget) method (If/Until merged you may have to fork it like I am doing https://github.com/awslabs/kinesis-kafka-connector/pull/12/files#diff-1cefe86e4e91f4a5c9a7637008435f10R84)
Change aggregation
configuration to false
so that each UserRecord sent in its own KinesisRecord.
I've not tested this. But based on my comprehension this should get you to exactly once, and only moving the commit offset in kafka forward once it has produced to kinesis.
@nehalmehta can you confirm this approach ?
Hi @jessecollier,
Exactly once delivery (Guaranteed) is difficult to achieve in any streaming service. Changing aggregation to false and using synchronous flush will absolutely help but cannot guarantee exactly once delivery. Let's take one of the edge case: If connector sends message to Kinesis using synchronous flush and without aggregation, but due to network lag it does not gets responses within request timeout period and in that case connector will assume message delivery has failed and will attempt to publish one more time. In some cases it may have reached Kinesis endpoint after request timeout period and so we will have duplication.
In my opinion it is better to have de-duplication functionality on consumer side in this use case.
Thanks, Nehal
Guaranteed Delivery - Since the connector uses call back we can't guarantee delivery of the messages. How can we make sure guaranteed delivery of the records? Currently the connector follow FIRE-AND-FORGET schematics.
Avoid duplication - Since we fetch records from Kafka records in batches, if KPL fails for some connectivity issue then we end up having duplicate records in Kinesis (because we cannot control Kafka offset)