Tyler-pierce / ElasticFlow

A library to enable Flow to operate in a cluster similarly to EMR + Spark
20 stars 1 forks source link

Learnings #1

Open matreyes opened 4 years ago

matreyes commented 4 years ago

Hi Tyler,

I've been studying about stream processing (Kafka & Flink, and in Elixir: Broadway, Flow, Genstage), and this project seems like a natural evolution. Could you share some learnings about this project? what are the strengths and drawbacks. What should be done to be a production alternative to say Flink?

Thanks!

Tyler-pierce commented 4 years ago

Hello Matias It's an interesting field to study isn't it. It's been a while since I mentally mapped out this program but basically I wanted to familiarize myself with the problem space and see how Erlang's VM fit's into the picture.. I was comparing to EMR at the time which was something I used for a few workloads.

There are some strengths here. The programming model is simpler and as a project grows in complexity I can't help but think something built on erlang is going to continue to behave predictably as it grows. Writing a program for a concurrent workload while in the Erlang paradigm has fewer constraints; write as you would for any BEAM program and you're likely to make something workable. So overall the upside is being able to work within the toolset and environment of Erlang/Elixir which should make it efficient to test, build tooling around, scale and deploy.

On the downside, this project is immature and simply a concept at the moment. I've only tested it in limited scenarios like word counting and such. It has a few optimizations but isn't truly tuned for a lot of scenarios with real life workloads. For example I believe it was set up for master node to do all final merging to retrieve answers and also distributed workloads, which is a bottleneck. The general chattiness reputably of erlang in a distributed environment would have to be looked into. So network activity would need to be compared to Flink and look into what optimizations are possible. In terms of computations themselves, I haven't really dug into in practice how speedy things will be but it's something to keep an eye on. Also need to check whether I've accidentally created other bottlenecks in terms of data receiving and output.. if I remember correctly I may need to set up more workers or do something to optimize that or else the workload won't be distributed quickly enough. And of course in terms of work being guaranteed and a few other Flink niceties.. there's some work done toward that but I don't think all the logic for retrying or storing unfinished work is there. We have the right tools to get that done of course but when business critical things are running 24/7 more tooling needs to come out of the box I think (logging etc).

So, I'd say there is potential in that Flow is a small program with a lot of power and this experiment was created very quickly. But there is certainly a lot of work left to do to catch up to a production solution.

Feel free to continue discussion and thanks for having a look!

matreyes commented 4 years ago

Amazing, Thank you!, would you mind if I post this conversation in elixirforum? I think it could be useful for other people.

Tyler-pierce commented 4 years ago

Thanks! I do think my answer could be better with a little more research and refreshers but if you feel it will start the conversation you'd like then by all means. Glanced over the code quickly and I'm definitely not happy with a lot of it. Pretty early elixir project for me and some of the bottlenecks I mentioned do exist. Maybe I'll do a quick pass at it soon

matreyes commented 4 years ago

Oh, but that’s so common.. nobody is happy with code that has been developed in the past…

I've just came across a page about Flink internals, I will take a look to understand it better before I get into the discussion.

Best!