Parsely / pykafka

Apache Kafka client for Python; high-level & low-level consumer/producer, with great performance.
http://pykafka.readthedocs.org/
Apache License 2.0
1.12k stars 232 forks source link

Tornado #731

Open mrocklin opened 7 years ago

mrocklin commented 7 years ago

I would like to consume and produce messages using PyKafka from within a Tornado application. I am potentially willing to contribute code for this. I have a few questions.

  1. Are the maintainers of this library comfortable adding tornado as an optional dependency? This would bloat your testing stack a bit. I'm also happy to do this externally if preferred.
  2. Is there already a mechanism to run a callback whenever data arrives? Using the non-blocking consume will work in most cases, but when it returns None we'll want to be triggered when data does arrive. Sometimes frameworks like this provide a mechanism for a callback function, which in our case could be setting an Event object. I'm particularly interested in the librdkafka consumer. As far as I can tell looking at the code there is no such mechanism exposed at the Python level, but I thought I would check. If forced we can always use a separate thread for this.

Separately, I noticed that there is no librdkafka solution for balanced consumers. Is this a limitation of librdkafka or is it that pykafka has not wrapped this functionality yet? If the latter then is there any near-term plan to support this?

mrocklin commented 7 years ago

On the producer side is there any back pressure? There doesn't seem to be any block= option when calling the produce(...) method. Is there a way to emit a message in a robustly non-blocking way and err if blocking would be inevitable?

emmettbutler commented 7 years ago

In general I'd like to avoid adding additional dependencies where possible, but it will be helpful to understand exactly what a Tornado dependency would enable. If there's a good tradeoff between tornado-specific code in pykafka and ease of use for users of both, it could be worth the added dependency. Without knowing what this will look like, though, I'm wary of adding even more to our already somewhat bloated test requirements.

Some work was started a while ago by @mikepk on a callback mechanism similar to the one you're describing but for produced messages in https://github.com/Parsely/pykafka/pull/506. If it seems necessary, we can either adapt that work or start from scratch on a callback interface that would meet your needs. I think when we chat tomorrow we'll have a better understanding of the specific requirements around this.

You might be noticing that there's no balanced_consumer.py in the rdkafka directory. You can still use rdkafka with balanced consumers through the use_rdkafka kwarg on BalancedConsumer.

jeffwidman commented 7 years ago

I noticed that there is no librdkafka solution for balanced consumers

That's because those should go the way of the dodo bird. Much better to use Kafka's native Consumer Group API's which are supported by librdkafka and also exposed by pykafka as ManagedBalancedConsumers. Pykafka's BalancedConsumer was built before these native Kafka API's existed, and (IMHO based on our production usage) has fairly serious problems such as https://github.com/Parsely/pykafka/issues/354

emmettbutler commented 7 years ago

To clarify, the callback that would solve this issue would be called here when a message becomes available on one or more partitions.