CountVectorizer for text preprocessing

jrdzha commented 4 years ago

I understand why hash seems to be a better solution for distributed text preprocessing, but I also need a way to make my features human-readable. It seems like spark has a CountVectorizer. Would it be possible to implement one for dask-ml?

TomAugspurger commented 4 years ago

Yeah, that should be relatively straightforward, especially if you provide a vocabulary: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, which will tell us the output shape without having to look at the data.

If a vocabulary isn't provided then I think we'd need to make two passes over the data

Determine all the unique terms in the documents
Transform the values into the appropriate place in the discovered vocabulary

Are you interested in working on this?

mrocklin commented 4 years ago

@lesteve you might find this issue interesting.

jrdzha commented 4 years ago

@TomAugspurger Yes this is something I'd be interested in contributing. I'm not familiar with dask internals, so I'll need to see...

TomAugspurger commented 4 years ago

Thanks. Let me know if you run into any stumbling blocks.

My guess is that the end result will

Somehow determine the vocabulary (the trickiest part I think)
Use input.map_partitions(est.transform, ...) with an sklearn.feature_extreaction.text.CountVectorizer with that vocabulary

On Thu, Jul 2, 2020 at 10:26 PM Jared Zhao notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger Yes this is something I'd be interested in contributing. I'm not familiar with dask internals, so I'll need to see...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/689#issuecomment-653320239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXZS76LXLBVOHRKWTTRZVFVJANCNFSM4ONACIQA .

stsievert commented 4 years ago

Somehow determine the vocabulary (the trickiest part I think)

Why won't the naive solution work?

>>> import dask.array as da
>>> def get_vocab(dask_array):
...     return da.unique(y)
>>> x = [["this", "is"], ["a", "sentence"]]
>>> y = da.from_array(x, chunks=(1, -1))
>>> vocab = get_vocab(y)
>>> vocab.compute()
array(['a', 'is', 'sentence', 'this'], dtype='<U8')

Doesn't compute have to be called to use Scikit-learn's CountVectorizer?

jrdzha commented 4 years ago

@stsievert I believe that unique is not good enough, since that just counts unique strings. CountVectorizer is to break strings into n-gram words. For an example, "this is a sentence" can be broken into "this", "is", "a", "sentence" or "this is" "a sentence". So there is some non-trivial computation that needs to be done on each string record in a array.

stsievert commented 4 years ago

@jrdzha I think this solution would work to compute the vocabulary:

# input
docs = da.from_array(["this is one document", "this is a second document"], chunks=1)

# Calculate vocab
est = CountVectorizer(...)
vocabs = docs.map_blocks(lambda x: est._count_vocab(x, est.fixed_vocabulary_)[0], dtype=dict)
vocabs = [set(vi) for vi in vocabs.compute()]
vocab = {w: i for i, w in enumerate(set().union(*vocabs))}

.compute still needs to be called, which was my main question in https://github.com/dask/dask-ml/issues/689#issuecomment-653614496. Is there any way to avoid that?

jrdzha commented 4 years ago

@stsievert Well we just need to have a shared vocabulary across the workers, so I think maybe an Actor could work too?

stsievert commented 4 years ago

@jrdzha I would run with the workflow in https://github.com/dask/dask-ml/issues/689#issuecomment-653814095. Avoiding a .compute call would be nice but I don't think it's necessary. I think actors would get around the issue, but I'm not sure that complexity is warranted.

TomAugspurger commented 4 years ago

I don't think actors are necessary. They're designed for mutating shared state across workers. But for the no user-provided vocabulary case, there's no shared state to mutate. We don't know the shape of the output until we examine the raw documents.

Avoiding a .compute call would be nice but I don't think it's necessary.

I don't see a way to avoid it (other than users providing it ahead of time).

jrdzha commented 4 years ago

@TomAugspurger That makes sense to me. One question I have though is: wouldn't calling compute make Dask behave in a non-lazy way? Users may not expect that behavior?

TomAugspurger commented 4 years ago

In dask-ml, fit always computes. The only question here is whether or not we have one or two passes over the data.

jrdzha commented 4 years ago

@TomAugspurger That makes sense to me. I think we do need 2 calls to compute. This is my understanding:

We would need partition-specific CountVectorizers per worker.
We need to fit these CountVectorizers on each partition to get partition-specific vocabulary.
The first compute is to execute this, and get each partition's vocabulary
The total vocabulary can be constructed on the client, then applied back to each worker.
New CountVectorizers with the total vocabulary are redefined on each worker for each partition.
The second compute is to fit this new CountVectorizer for each partition.

One question I have (I'm not too familiar with dask internals) is, if each partition blows up in size, how can I "repartition" from each worker? aka split a large partition into n partitions, and potentially requesting more workers from the scheduler?

TomAugspurger commented 4 years ago

if each partition blows up in size, how can I "repartition" from each worker?

Generally, I think this falls on the user rather than the algorithm. If the transformed version is much larger than the original then it's easiest for them to rechunk prior to using it.

stsievert commented 4 years ago

@jrdzha what progress have you made on this issue? I can help out if you'd like. I'm excited for this PR because I've handled a fair number of requests for CountVectorizer.

jrdzha commented 4 years ago

@stsievert I've been experimenting in my down time on a jupyter notebook -- I'm quite busy at the moment, so any help would be fantastic...

TomAugspurger commented 4 years ago

I'm looking into this today.

dask / dask-ml

CountVectorizer for text preprocessing #689