Closed virivigio closed 3 years ago
Hi Virgilio,
auto.commit.interval.ms
is how often offsets are committed in ZooKeeper, default value is 60 seconds (which is huge).
Of course, the number of duplicates correlates well with current "stress": the more messages are read, the highest is the chance that more messages were read after offsets were committed for the last time. You may want to reduce the auto-commit interval to some lower value.
The best way would adding a support for Kafka low-level consumer, which would help avoid events duplication, but:
1) I'm not sure what's the best method to store offsets in Kettle, and how to align them with retries, etc. 2) To facilitate exactly once processing, the whole pipeline must support it.
So, the only option I see is really playing with auto-commit interval.
Great, now I have a better understanding. As said, at least once is ok for my scenario. My regards Virgilio
Hi, thank you for this plugin. I'm experiencing a behaviour I don't understand and I'd like to figure it out whether is there are errors on my business logic. Here it is my skeleton: And I'm using mostly default values:
The transformation on the right is called every 1000 rows. It's about 500 rows per second and it is aligned with the message throughput. Sometimes we have to stop it for a while so, upon restart, there is a lag which can be consistent (400k message). This means that for half an hour we have kettle filling up server resources to catch up. It's fine and it works. But in those cases, and only in those where the server is at full capacity, the kafka consumer emits more messages than the actual number in the kafka topic. Eg:
I understand this is acceptable in an at-least-once scenario, and it's ok, we discard duplicates, no harm. But I'm just curious about understanding how those repetitions are related with performances. Why this happens on "stress"? Is there any timeout I overlooked which makes the consumer "retry"?
Thank you Virgilio