crowdrec / idomaar

CrowdRec reference framework
Apache License 2.0
32 stars 12 forks source link

Orchestrator - computing environment data streaming #45

Closed andras-sereny closed 9 years ago

andras-sereny commented 9 years ago

This issue is a continuation of an email thread about orchestrator - CE communication:

Hi Andras, I agree with you, all your comment are correct, I would like to explain to you why we choose to use kafka. The main idea is to use component that are "production ready" and that could be easily be scaled to manage "big" number of events. Of course this introduce a lot of complexity in the development of the computing env (and also in the orchestrator stuff) but also give us the possibility to use very common objects (like kafka and zookeeper).

I also agree with you that zmq is already present and could be used instead of kafka, one idea is to have a restful API that receive events and data (this cover the newsreel integration too) so to have the developer to have the possibility of choosing web api or kafka without changing the rest of the orchestrator.

What do you think? can we continue the talk on a new github issue so to track our conclusion?

thanks, Davide

On Tue, Mar 10, 2015 at 9:43 AM, András Serény sereny.andras@gravityrd.com wrote:

Hi All,

On another note, although I don't see the full Idomaar picture yet, I find the Kafka/Flume setup complicated and unnecessary given the current state.

    Kafka is a persistent, highly available, distributed messaging service; the features come at the expense of a certain complexity. I don't think we really need either persistence or high availability for sending/receiving train and test data.
    We now a have a handful of services: Kafka, Zookeeper, Flume, recommendation manager. This is just too many moving parts (especially for a small application). A lot of things can go wrong (and they do) in these services that we have to deal with; this slows down development. It's difficult or impossible to install the components locally. It takes long to build a VM with all the components.
    On computing environment implementations, it places the burden of being able to connect to Zookeeper and read from Kafka.

In short, it's complicated and the benefits are unclear. Computing environments already have to have the ability to talk via ZMQ, so why not just send train/test data and recommendation requests via ZMQ ? This would make the whole setup much simpler.

Cheers,
András
andras-sereny commented 9 years ago

Thanks Davide. I like Kafka very much and it might be the right tool if we want to send event logs from one component to another. I wonder, though, in our use case, what does it bring? Kafka’s strength is in the fact that data can be consumed multiple times, even at another time; messages are durable. I think we just want to send data immediately, no need to store.

As for scaling, and the big number of events, what’s the scenario you have in mind? A computing environment running multiple consumers to process data faster? We could easily use multiple ZMQ connections. Generally speaking, ZMQ claims to scale. I guess all of this is a pure theoretical concern, since the orchestrator can push data through a single channel much faster than the answer rate of any reasonably sophisticated recommender system.

How about we implement sending train/test data to the computing environment via ZMQ? Leaving the Kafka solution in place, but as an option for CEs. I expect the ZMQ implementation and usage to be much simpler, for development purposes especially.

Yes, I think it’s a good idea to offer computing environments a choice between different communication methods. REST APIs are of course very popular thus it seems a good choice.

andras-sereny commented 9 years ago

As discussed in the workshop, we'd like to use Kafka/Flume in a production setting. Supporting a development setup is continued in #49.