cloudml / zen

Zen aims to provide the largest scale and the most efficient machine learning platform on top of Spark, including but not limited to logistic regression, latent dirichilet allocation, factorization machines and DNN.
Apache License 2.0
170 stars 75 forks source link

Constrained ALS and ALM #16

Open debasish83 opened 9 years ago

debasish83 commented 9 years ago

@witgo

I have a package for factorization that's based on ml.recommendation.ALS but several major changes:

  1. For ALS, user and product constraints can be specified. This allows us to add column wise L2 regularization for words and L1 regularization for documents (through Breeze QuadraticMinimizer) to run sparse coding.
  2. In place of L1 regularization, probability simplex can be added on documents and positive constraints on words to get PLSA constraints with least square loss.
  3. Alternating Minimization supports KL Divergence and likelihood loss with positive constraints in matrix factorization to run PLSA formulation and generate LDA results through factorization.
  4. Alternating minimization shuffles sparse vectors and is designed to scale to large ranks matrix factorization like petuum.

Details are on the following JIRAs:

  1. https://issues.apache.org/jira/browse/SPARK-2426
  2. https://issues.apache.org/jira/browse/SPARK-6323

If it looks useful, I can add a factorization package in zen and bring the code from the Spark PRs. zen is already in spark-packages and so I don't have to introduce another new package. If users find it useful, may be later we can move it back to ml. It changes user facing API significantly.

Next I want to move these algorithms to graphx API and compare the runtime and efficiency. Since zen is focused on optimizing graphx for ML, I feel zen is an ideal package for these factorization algorithms.

Factorization output are large distributed models and natural extension is to add few hidden layers between user/word and item/document and develop a distributed neural net formulation which should use optimized graphx API and I think you have already built many of these optimizations in zen.

hucheng commented 9 years ago

@debasish83 It would be great if you can contribute to Zen. We agreed that GraphX is suitable for factorization. Please feel free to propose PRs and we can discuss them one by one.

Thanks.