ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
252 stars 108 forks source link

Some doubts about this project #257

Open codlife opened 7 years ago

codlife commented 7 years ago

Hi devs: As we all know, spark is better then hadoop in many cases, especially about iteration machine learning, so my question is why do we do this continuly, why dont't we migrate this work to spark.Thank you very much! Best wishes!

zhangpengshan commented 7 years ago

Hi Codlife,

Thanks a lot for reaching us.

Actually Spark is a good implementation in iterative memory computing framework. We also implemented a iterative memory computing framework which is called guagua on Hadoop YARN or (MAPPER-ONLY job). So from framework level, Spark and Guagua are both in memory iterative computing framework. You can check our slides in QCON for more Guagua details: http://shifu.ml/docs/guagua/ http://www.slideshare.net/pengshanzhang/guagua-an-iterative-computing-framework-on-hadoop https://github.com/ShifuML/guagua

I list some reasons why we are still in our own framework:

  1. Our framework is before Spark, and our Hadoop cluster in Hadoop 1.x for several years. While guagua can be both on Mapper-only Job or YARN, which means it can be run in Hadoop 1.x very well.
  2. Straggler issue: Spark is very good while we find issues like straggler task although speculative and very hard to do computing over thousands iterations. While guagua solves such issues well, in current guagua, partial complete is supported which means every iteration you can set 95%, then 95 workers complete such iteration is completed. It is like Asyc Spark in your repo on idea level.
  3. Our core algorithms like neural network and GBDT, Spark doesn't supported very well. Neural network Spark MLLib only has some initial commits which is not mature. GBDT in Spark MLLib is good while not good in thousands of trees (do checkpoint is better), weighted training is not supported in MLLib GBDT .
  4. Actually a mature algorithm is very important, if you are familiar with Shifu, what we care about is not only good algorithms but also end to end pipeline. Spark MLLib pipeline is good while no enough features for us like binning, feature selection, woe transform, KS &IV computing, bagging framework...

You mentioned 'spark is better then hadoop', but spark can also be run in Hadoop YARN, in model training step, we do in-memory iterative computing like Spark on YARN, Hadoop is leveraged as resource manager. Although guagua is based on Mapper-Only job or YARN, but that's not real MapReduce. In this cases, Spark and guagua are the same framework.

'spark is better then hadoop' in iterative computing but also for DAG jobs. Some other steps like feature statistics and feature transform in our pipeline we are still in MapReduce/Pig. We evaluated Spark and we found not very big performance gain (10% performance improvement). Our MapReduce/Pig jobs are not very difficult DAG jobs and cache is not effective. Such feature migration to Spark is in our issue list while this is not our bottleneck. Our bottleneck is still in training.

The last reason we would like to make Shifu light-weight (30M tar package). If Spark is introduced, that will be our big dependency and we would like to make Spark features out of Shifu like spark packages.

BTW, Spark without DRA is nightmare, all idle Spark jobs will eat all clusters without releasing resources in time. Although DRA is enabled in our cluster, still some guys are using their own Spark and not to release resources after use. And for Spark it is very hard to estimate how many resources you needed actually especially in in-memory computing.

Not sure if you are clear now but please do feel free to comment if you still have further questions.

Thanks, Zhang Pengshan