dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

distributed TF IDF with Multinomial Naive Bayes #115

Closed DavisTownsend closed 1 year ago

DavisTownsend commented 6 years ago

would be state of the art and very useful if Dask could natively handle distributed TF IDF matrices as input to a multinomial naive bayes model. I know this is a difficult problem to solve because for most implementations of computing TF IDF you need the entire Term Document matrix in memory so I'm not sure know how to solve this problem tbh. Problem referenced here: https://stackoverflow.com/questions/25145552/tfidf-for-large-dataset

TomAugspurger commented 6 years ago

https://github.com/dask/dask-ml/issues/5 May be of interest.

DavisTownsend commented 6 years ago

yeah I've seen that, didn't seem like a final answer was shown there. I'td be nice to have it natively supported though so I can make my business process depend on it without worrying too much about supporting it myself going forward