Consumer reads messages more than once

virivigio commented 3 years ago

Hi, thank you for this plugin. I'm experiencing a behaviour I don't understand and I'd like to figure it out whether is there are errors on my business logic. Here it is my skeleton: And I'm using mostly default values: defaults

The transformation on the right is called every 1000 rows. It's about 500 rows per second and it is aligned with the message throughput. Sometimes we have to stop it for a while so, upon restart, there is a lag which can be consistent (400k message). This means that for half an hour we have kettle filling up server resources to catch up. It's fine and it works. But in those cases, and only in those where the server is at full capacity, the kafka consumer emits more messages than the actual number in the kafka topic. Eg:

kafka topic has 100k records. no new messages arriving (I can control this and I did to have a clean test)
we expect the consumer to emit 10k rows; then we expect it to wait indefinitely
the consumer actually emits 10080 messages, where those 80 are repetitions of messages already emitted, then it waits as expected
as a doublecheck: the subsequent transformation confirms that exactly 80 messages have an offset already registered

I understand this is acceptable in an at-least-once scenario, and it's ok, we discard duplicates, no harm. But I'm just curious about understanding how those repetitions are related with performances. Why this happens on "stress"? Is there any timeout I overlooked which makes the consumer "retry"?

Thank you Virgilio

spektom commented 3 years ago

Hi Virgilio,

auto.commit.interval.ms is how often offsets are committed in ZooKeeper, default value is 60 seconds (which is huge).

Of course, the number of duplicates correlates well with current "stress": the more messages are read, the highest is the chance that more messages were read after offsets were committed for the last time. You may want to reduce the auto-commit interval to some lower value.

The best way would adding a support for Kafka low-level consumer, which would help avoid events duplication, but:

1) I'm not sure what's the best method to store offsets in Kettle, and how to align them with retries, etc. 2) To facilitate exactly once processing, the whole pipeline must support it.

So, the only option I see is really playing with auto-commit interval.

virivigio commented 3 years ago

Great, now I have a better understanding. As said, at least once is ok for my scenario. My regards Virgilio

RuckusWirelessIL / pentaho-kafka-consumer

Consumer reads messages more than once #34