apache / openwhisk-package-kafka

Apache OpenWhisk package for communicating with Kafka or Message Hub
https://openwhisk.apache.org/
Apache License 2.0
33 stars 43 forks source link

Add configurable "batch.size" property for feed subscriber #208

Open jthomas opened 7 years ago

jthomas commented 7 years ago

Feed subscribers should be allowed to configure the "batch.size" parameter to control how many messages are sent per trigger invocation. This will give users the ability to use Kafka as a more traditional queue by setting the "batch.size" to 1.

Apache Kakfa recently introduced the max.poll.records client configuration parameter to support this.

jberstler commented 7 years ago

@jthomas I'm having a hard time imagining this value getting set to something other than 1. I wonder if instead this ought to be a boolean named batchMessages with the default value being true, and when set false always fires the trigger with exactly one message.

Implementing this should be trivial, but it needs to be excruciatingly clear to the user that because OpenWhisk trigger rate limits will typically be far lower than Kafka's typical rate of producing messages, limiting (or turning off) batching can conceivably result in the trigger falling hopelessly behind the current state of the topic - even to the point of completely missing messages that expire out of the topic before they are ever consumed by the feed.

jthomas commented 7 years ago

@bjustin-ibm Using a boolean for toggling message batching would be nicer. I still think supporting a "batch.size" or "max.records" to control the batch size when messages.batch == true is useful.

Being able to control the parallelism when processing large amounts of messages seems really useful. Looks like this the batch size defaults to 1MB (fetch.message.max.bytes) https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md which is fairly large.

Alongside making the documentation really clear on this issue, I'm wondering if there's any way we can surface these errors to the user? Thinking out loud, either pushing messages to an "error" queue or firing some kind of user-provided error trigger. There's probably a broader issue of how feed providers surface errors to clients.

Another issue to consider is, is it more efficient to still have the provider receive messages in large batches and fire triggers to the user-specified rate or just use those parameters directly in the consumer client?