dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

sklearn-Ensenble_Tree Algo distributed #524

Open xiaozhongtian opened 5 years ago

xiaozhongtian commented 5 years ago

Hello, I'm doing a project that needs to use dask-ml library to treat the large dataset. I didn't find the basic algos distributed like the DecisionTree,RandomForest in dask-ml. If i use the sklearn Tree algos, there will be perhaps the problem of the memory.

TomAugspurger commented 5 years ago

Right, Dask-ML doesn't have any distributed tree-based estimators at the moment.

https://github.com/dask/dask-ml/issues/299 may be interesting. Scikit-Learn now has expanded Olivier's prototype to https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.

xiaozhongtian commented 5 years ago

OK. Maybe for now I will find something in dask.xgboost and dask.lightgbm instead.

TomAugspurger commented 4 years ago

Collecting links from https://github.com/dask/dask-ml/issues/299

http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree describes a basic algorithm for distributed gradient boosting, and then a more efficient, but much more complicated algorithm.

cc @nicolashug. It seems like we won't be able to reuse much or any of the scikit-learn implementation if we wanted a distributed implementation.