dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

TFIDF vectorizer #744

Closed dfernan closed 3 years ago

dfernan commented 3 years ago

I'd want to be able to do TF-IDF calculations in Dask in a similar fashion as in apache spark. https://spark.apache.org/docs/latest/mllib-feature-extraction.html

Right now, counting/hashing vectorizer are similar and it does exist. They’re in Dask-ML and are the same as TFIDF without the normalization/function.

Another implementation exists in rapids/cuml, but not sure how easy it'd be to translate it to the Dask framework. https://github.com/rapidsai/cuml/blob/cdb14e7de6a40d8d707d29b2889a89aa553125ee/python/cuml/feature_extraction/_tfidf_vectorizer.py.

stsievert commented 3 years ago

I think this issue belongs in Dask-ML. @TomAugspurger is it possible to move this issue over there?

I'm not seeing tfidf normalization in CountVectorizer or HashingVectorizer. The documentation of HashingVectorizer specifically mentions the following:

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

  • ...
  • no IDF weighting as this would render the transformer stateful.

One method to resolve this issue would be to implement TfIdfTransformer.fit.

TomAugspurger commented 3 years ago

Thanks. We have https://github.com/dask/dask-ml/issues/5 for this already. Let us know if you're interested in working on it.

stsievert commented 3 years ago

@dfernan I also point to the cuml implementation in https://github.com/dask/dask-ml/issues/5#issuecomment-698427819, which explicitly supports Dask.