Closed jrdzha closed 4 years ago
Yeah, that should be relatively straightforward, especially if you provide a vocabulary
: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, which will tell us the output shape without having to look at the data.
If a vocabulary
isn't provided then I think we'd need to make two passes over the data
Are you interested in working on this?
@lesteve you might find this issue interesting.
@TomAugspurger Yes this is something I'd be interested in contributing. I'm not familiar with dask internals, so I'll need to see...
Thanks. Let me know if you run into any stumbling blocks.
My guess is that the end result will
vocabulary
(the trickiest part I think)On Thu, Jul 2, 2020 at 10:26 PM Jared Zhao notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger Yes this is something I'd be interested in contributing. I'm not familiar with dask internals, so I'll need to see...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/689#issuecomment-653320239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXZS76LXLBVOHRKWTTRZVFVJANCNFSM4ONACIQA .
Somehow determine the
vocabulary
(the trickiest part I think)
Why won't the naive solution work?
>>> import dask.array as da
>>> def get_vocab(dask_array):
... return da.unique(y)
>>> x = [["this", "is"], ["a", "sentence"]]
>>> y = da.from_array(x, chunks=(1, -1))
>>> vocab = get_vocab(y)
>>> vocab.compute()
array(['a', 'is', 'sentence', 'this'], dtype='<U8')
Doesn't compute
have to be called to use Scikit-learn's CountVectorizer?
@stsievert I believe that unique is not good enough, since that just counts unique strings. CountVectorizer is to break strings into n-gram words. For an example, "this is a sentence" can be broken into "this", "is", "a", "sentence" or "this is" "a sentence". So there is some non-trivial computation that needs to be done on each string record in a array.
@jrdzha I think this solution would work to compute the vocabulary:
# input
docs = da.from_array(["this is one document", "this is a second document"], chunks=1)
# Calculate vocab
est = CountVectorizer(...)
vocabs = docs.map_blocks(lambda x: est._count_vocab(x, est.fixed_vocabulary_)[0], dtype=dict)
vocabs = [set(vi) for vi in vocabs.compute()]
vocab = {w: i for i, w in enumerate(set().union(*vocabs))}
.compute
still needs to be called, which was my main question in https://github.com/dask/dask-ml/issues/689#issuecomment-653614496. Is there any way to avoid that?
@stsievert Well we just need to have a shared vocabulary across the workers, so I think maybe an Actor could work too?
@jrdzha I would run with the workflow in https://github.com/dask/dask-ml/issues/689#issuecomment-653814095. Avoiding a .compute
call would be nice but I don't think it's necessary. I think actors would get around the issue, but I'm not sure that complexity is warranted.
I don't think actors are necessary. They're designed for mutating shared state across workers. But for the no user-provided vocabulary case, there's no shared state to mutate. We don't know the shape of the output until we examine the raw documents.
Avoiding a .compute call would be nice but I don't think it's necessary.
I don't see a way to avoid it (other than users providing it ahead of time).
@TomAugspurger That makes sense to me. One question I have though is: wouldn't calling compute make Dask behave in a non-lazy way? Users may not expect that behavior?
In dask-ml, fit
always computes. The only question here is whether or not we have one or two passes over the data.
@TomAugspurger That makes sense to me. I think we do need 2 calls to compute. This is my understanding:
One question I have (I'm not too familiar with dask internals) is, if each partition blows up in size, how can I "repartition" from each worker? aka split a large partition into n partitions, and potentially requesting more workers from the scheduler?
if each partition blows up in size, how can I "repartition" from each worker?
Generally, I think this falls on the user rather than the algorithm. If the transformed version is much larger than the original then it's easiest for them to rechunk prior to using it.
@jrdzha what progress have you made on this issue? I can help out if you'd like. I'm excited for this PR because I've handled a fair number of requests for CountVectorizer.
@stsievert I've been experimenting in my down time on a jupyter notebook -- I'm quite busy at the moment, so any help would be fantastic...
I'm looking into this today.
I understand why hash seems to be a better solution for distributed text preprocessing, but I also need a way to make my features human-readable. It seems like spark has a CountVectorizer. Would it be possible to implement one for dask-ml?