Why does this exist - what problems does it solve!?

elsmorian commented 9 years ago

I increasingly find more and more cool GitHub projects on the web that look really interesting, dpark is one of them. And I see that you have re-written spark in Python, which is cool because I am a Python fan, and am also interested in Spark. But I find no mention as to why, and this puzzles me- what problems with original Spark were you trying to solve? Did you find some big bug with the current implementation, or was this maybe instead just a learning excise, something done for fun (A impressive excretes if so!)?

Basically it would be nice to have a paragraph in the README just saying why this projects exists- what limitations it was trying to overcome or performance to gain etc.

Many thanks!

windreamer commented 9 years ago

Hi elsmorian,

DPark is actually a project designed for our in-house need. as in Douban inc. we use python for almost everything, to build an analytic platform with Python is a natural choice. Furthermore, most of our data and logs have been stored in MooseFS for quite a long time, so it would be impossible to migrate all of these to HDFS or any other distributed filesystem supported by Spark. And finally we find building the whole stack with Python is lightning fast, and free of limitations: we can use lambda functions, import whatever packages we want, using Numba to JIT hot code to speed up mathematics calculations. To maintain the whole Java eco-system of Spark is definitely a pain for us, so DPark is our answer. Hope this will give Python community another choice for parallel computing.

The project exists because it needs to, and this is all based on our situations. So to be frankly, I can't find a reason we should explain why this project exists in README. I think (maybe wrong) people facing the same problem will find this project and feel it useful.

elsmorian commented 9 years ago

Hi windreamer, thanks for the reply. This is a really interesting project for me as I also work in a Python-focused company, and we are looking at using Spark!

I think it would still be useful to add a one-liner to the Readme however, just to explain why it came about. If people think its a hobby or experiment project etc they won't use it in production. It just gives a little more information about the project to the people who might end up using it, creating pull requests etc.

windreamer commented 9 years ago

I will consider this, thank you for the suggestion.

elsmorian commented 9 years ago

Thanks for your reply! I will continue to watch this repo with interest :smile:

douban / dpark

Why does this exist - what problems does it solve!? #53