NetherlandsForensicInstitute / kafka-spout

Kafka consumer emitting messages as storm tuples
Apache License 2.0
103 stars 73 forks source link

Will HolmesNL/Kafka-Spout do support for Storm Clusters? #13

Closed harishshan closed 9 years ago

harishshan commented 9 years ago

I am already using nathanmarz/storm-contrib/storm-kafka, It is working fine for me as of now. Recently I heard about HolmesNL/Kafka-spout. I wanna clarify, whether Kakfa-Spout will support for Storm Clusters. If It will support, I need a simple document with example as you have already give examples for local cluster. Please do the need full. Thank you in advance :+1:

akaIDIOT commented 9 years ago

With "Storm Cluster", I suppose you mean a collection of worker processes spread out over multiple machines? If so: any spout should support that, including this one. There's no difference in code between configuring things for a local / test cluster or a real one. See the storm docs on submitting a topology to a production cluster, that uses a TopologyBuilder too.

harishshan commented 9 years ago

Thank akaIDIOT, Yes I mean the same for storm cluster. I need the guarantee that the Kafka Topic message should be processed only once on multiple workers over multiple machines. Messages should be processed on all machines workers.

akaIDIOT commented 9 years ago

This kafka spout uses the high level Kafka API (which is what sets it apart from the storm-contrib one). The high level API groups clients into a consumer group and hands out messages among these clients. Multiple instances of a kafka spout will be grouped in such a consumer group, guaranteeing that a message will be emitted from a spout only once in normal conditions. Note that this is a guarantee made by the Kafka high level consumer API. Normal conditions is an important distinction here, as the spout commits information about which messages have been processed to the Kafka brokers. If a node crashes after emitting some messages but before this fact has been committed, another spout instance or a new one created by Storm will emit these messages again. A second condition to note here is failed Storm tuples. The spout allows you to control what happens when a message is failed (like ignoring the failure, replaying the message or replaying it at most x times).

Note that storm itself can't guarantee exactly-once-processing. Using Trident you can achieve something like it, but I believe this spout won't plug and play with that, we might add that in the future.


Long story short: it won't blindly spam messages multiple times, but there's no guarantee that a bolt won't see a message twice in some conditions. These conditions typically apply when you run into massive network lag, broken hardware or other unexpected crashes.

harishshan commented 9 years ago

Thanks akaIDIOT