Open codlife opened 7 years ago
Hi Codlife,
Thanks a lot for reaching us.
Actually Spark is a good implementation in iterative memory computing framework. We also implemented a iterative memory computing framework which is called guagua on Hadoop YARN or (MAPPER-ONLY job). So from framework level, Spark and Guagua are both in memory iterative computing framework. You can check our slides in QCON for more Guagua details: http://shifu.ml/docs/guagua/ http://www.slideshare.net/pengshanzhang/guagua-an-iterative-computing-framework-on-hadoop https://github.com/ShifuML/guagua
I list some reasons why we are still in our own framework:
You mentioned 'spark is better then hadoop', but spark can also be run in Hadoop YARN, in model training step, we do in-memory iterative computing like Spark on YARN, Hadoop is leveraged as resource manager. Although guagua is based on Mapper-Only job or YARN, but that's not real MapReduce. In this cases, Spark and guagua are the same framework.
'spark is better then hadoop' in iterative computing but also for DAG jobs. Some other steps like feature statistics and feature transform in our pipeline we are still in MapReduce/Pig. We evaluated Spark and we found not very big performance gain (10% performance improvement). Our MapReduce/Pig jobs are not very difficult DAG jobs and cache is not effective. Such feature migration to Spark is in our issue list while this is not our bottleneck. Our bottleneck is still in training.
The last reason we would like to make Shifu light-weight (30M tar package). If Spark is introduced, that will be our big dependency and we would like to make Spark features out of Shifu like spark packages.
BTW, Spark without DRA is nightmare, all idle Spark jobs will eat all clusters without releasing resources in time. Although DRA is enabled in our cluster, still some guys are using their own Spark and not to release resources after use. And for Spark it is very hard to estimate how many resources you needed actually especially in in-memory computing.
Not sure if you are clear now but please do feel free to comment if you still have further questions.
Thanks, Zhang Pengshan
Hi devs: As we all know, spark is better then hadoop in many cases, especially about iteration machine learning, so my question is why do we do this continuly, why dont't we migrate this work to spark.Thank you very much! Best wishes!