Closed dfernan closed 3 years ago
I think this issue belongs in Dask-ML. @TomAugspurger is it possible to move this issue over there?
I'm not seeing tfidf normalization in CountVectorizer or HashingVectorizer. The documentation of HashingVectorizer specifically mentions the following:
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
- ...
- no IDF weighting as this would render the transformer stateful.
One method to resolve this issue would be to implement TfIdfTransformer.fit.
Thanks. We have https://github.com/dask/dask-ml/issues/5 for this already. Let us know if you're interested in working on it.
@dfernan I also point to the cuml implementation in https://github.com/dask/dask-ml/issues/5#issuecomment-698427819, which explicitly supports Dask.
I'd want to be able to do TF-IDF calculations in Dask in a similar fashion as in apache spark. https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Right now, counting/hashing vectorizer are similar and it does exist. They’re in Dask-ML and are the same as TFIDF without the normalization/function.
Another implementation exists in rapids/cuml, but not sure how easy it'd be to translate it to the Dask framework. https://github.com/rapidsai/cuml/blob/cdb14e7de6a40d8d707d29b2889a89aa553125ee/python/cuml/feature_extraction/_tfidf_vectorizer.py.