dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

CountVectorizer.fit_transform fails with remote vocabulary #712

Closed TomAugspurger closed 4 years ago

TomAugspurger commented 4 years ago
In [1]: import dask.bag as db

In [2]: import dask_ml.feature_extraction.text

In [3]: from dask.distributed import Client
   ...: client = Client()
In [4]: vocab = {"foo": 0, "bar": 1}

In [6]: remote_vocab, = client.scatter((vocab,), broadcast=True)

In [7]: vect = dask_ml.feature_extraction.text.CountVectorizer(vocabulary=remote_vocab)

In [8]: bag = db.from_sequence(['foo bar', 'foo', 'bar'], npartitions=2)

In [9]: vect.fit_transform(bag)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-f0e608066e3c> in <module>
----> 1 vect.fit_transform(bag)

~/sandbox/dask-ml/dask_ml/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    188             vocabulary_ = vocabulary.compute()
    189
--> 190         n_features = len(vocabulary_)
    191         result = raw_documents.map_partitions(
    192             _count_vectorizer_transform, vocabulary_for_transform, params

TypeError: object of type 'Future' has no len()

Just transform works fine.

In [10]: vect.transform(bag)
Out[10]: dask.array<from-bag-_count_vectorizer_transform, shape=(nan, 2), dtype=int64, chunksize=(nan, 2), chunktype=scipy.csr_matrix>
mrocklin commented 4 years ago

What is the type of raw_documents? Most collections should support futures as inputs. Bags may not be hip enough yet? The easy solution may be to wrap it in a delayed?

On Fri, Jul 24, 2020 at 2:30 PM Tom Augspurger notifications@github.com wrote:

In [1]: import dask.bag as db In [2]: import dask_ml.feature_extraction.text In [3]: from dask.distributed import Client ...: client = Client()In [4]: vocab = {"foo": 0, "bar": 1} In [6]: remote_vocab, = client.scatter((vocab,), broadcast=True) In [7]: vect = dask_ml.feature_extraction.text.CountVectorizer(vocabulary=remote_vocab) In [8]: bag = db.from_sequence(['foo bar', 'foo', 'bar'], npartitions=2) In [9]: vect.fit_transform(bag)---------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 vect.fit_transform(bag) ~/sandbox/dask-ml/dask_ml/feature_extraction/text.py in fit_transform(self, rawdocuments, y) 188 vocabulary = vocabulary.compute() 189--> 190 nfeatures = len(vocabulary) 191 result = raw_documents.map_partitions( 192 _count_vectorizer_transform, vocabulary_for_transform, params TypeError: object of type 'Future' has no len()

Just transform works fine.

In [10]: vect.transform(bag)Out[10]: dask.array<from-bag-_count_vectorizer_transform, shape=(nan, 2), dtype=int64, chunksize=(nan, 2), chunktype=scipy.csr_matrix>

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/712, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEYSX2QOR4KJEGOC2TR5H4NXANCNFSM4PHCYS7A .