RuckusWirelessIL / pentaho-kafka-consumer

Apache Kafka consumer step plug-in for Pentaho Kettle
Apache License 2.0
66 stars 40 forks source link

Consumer reads messages more than once #34

Closed virivigio closed 3 years ago

virivigio commented 3 years ago

Hi, thank you for this plugin. I'm experiencing a behaviour I don't understand and I'd like to figure it out whether is there are errors on my business logic. Here it is my skeleton: skeleton And I'm using mostly default values: defaults

The transformation on the right is called every 1000 rows. It's about 500 rows per second and it is aligned with the message throughput. Sometimes we have to stop it for a while so, upon restart, there is a lag which can be consistent (400k message). This means that for half an hour we have kettle filling up server resources to catch up. It's fine and it works. But in those cases, and only in those where the server is at full capacity, the kafka consumer emits more messages than the actual number in the kafka topic. Eg:

I understand this is acceptable in an at-least-once scenario, and it's ok, we discard duplicates, no harm. But I'm just curious about understanding how those repetitions are related with performances. Why this happens on "stress"? Is there any timeout I overlooked which makes the consumer "retry"?

Thank you Virgilio

spektom commented 3 years ago

Hi Virgilio,

auto.commit.interval.ms is how often offsets are committed in ZooKeeper, default value is 60 seconds (which is huge).

Of course, the number of duplicates correlates well with current "stress": the more messages are read, the highest is the chance that more messages were read after offsets were committed for the last time. You may want to reduce the auto-commit interval to some lower value.

The best way would adding a support for Kafka low-level consumer, which would help avoid events duplication, but:

1) I'm not sure what's the best method to store offsets in Kettle, and how to align them with retries, etc. 2) To facilitate exactly once processing, the whole pipeline must support it.

So, the only option I see is really playing with auto-commit interval.

virivigio commented 3 years ago

Great, now I have a better understanding. As said, at least once is ok for my scenario. My regards Virgilio