dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Distributed TFIDF #5

Open dsevero opened 6 years ago

dsevero commented 6 years ago

Greetings!

I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.

Would this be the correct repo.?

I thought maybe a feature_extraction directory would be appropriate.

TomAugspurger commented 6 years ago

Things are still a bit in flux, but I think this would be a good home. This would be a great addition!

I thought maybe a feature_extraction directory would be appropriate.

Perfect. I'm trying to follow scikit-learn's layout as closely as possible.

dsevero commented 6 years ago

I see.

What's the difference between this repo. and dask-glm? Can't the later be a subset of the former?

TomAugspurger commented 6 years ago

What's the difference between this repo. and dask-glm?

I'm probably going to just import the dask-glm estimators into dask-ml namespace (likewise with dask-searchcv, dask-patternsearch). For the user, it'd be nice to have a single place to go for all dask-related ML things.

Development will probably still continue in those other repositories.

bendruitt commented 6 years ago

I'd be interested in seeing this code. How was this achieved?

dsevero commented 6 years ago

The way I first implemented it was something like this: (I changed it around so that I removed company private code and this was way before I started contributing to dask so my knowledge was far less than it is today)

import numpy as np
import dask.bag as db

from toolz.curried import frequencies
from toolz.curried import valmap
from toolz.curried import merge_with
from toolz.curried import unique
from toolz import merge_with as _merge_with

def normalize(_dict):
    return {k:v/sum(_dict.values()) for k,v in _dict.items()}

corpus = 2*[
    "a a b b b",
    "a a a a c",
    "c b a d e f g",
    "a b b b c c c d d d a b",
    "c b a d e f g",
    "a b b b c c c d d d a b"
]

base = (
    db
    .from_sequence(corpus, npartitions=4)
    .map(str.split)
)

tf = (
    base
    .map(frequencies)
    .map(normalize)
)

idf = (
    base
    .map(unique)
    .map(frequencies)
    .reduction(merge_with(sum), merge_with(sum), split_every=2)
    .apply(normalize)
    .apply(valmap(np.log10))
    .apply(valmap(np.negative))
)

This has a draw-back, since it must pass over the entire dataset once in order to calculate the idf step.

To me a better solution (which is what I want to implement here) would be first to implement the hashing trick and then use the tfidf.

bendruitt commented 6 years ago

Thanks Daniel,

Looks good. My end goal is to map multiple text columns in a dask dataframe to ngram vectors suitable for supervised machine learning in combination with other non-text columns. I’ll have a play around what you’ve done here and post back my results. I don’t think the hashing trick will work for me as I’ll require feature mapping at the end of all this. I’ll post my results.

dsevero commented 6 years ago

Cool.

The problem I faced when I decided to implement this was similar to yours. Our implementation originally used the get_dummies function in pandas on the text columns.

bendruitt commented 6 years ago

Yep. I've used the get_dummies function to process my categorical columns using map_partitions function on a dask dataframe since marking the input columns as categorical handles the schema merging per partition. What strategy does your original problem implement to form partitions with consistent partition schemas?

rth commented 6 years ago

An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer or HashingVectorizer on chunks of the dataset, merge the results and then apply IDF weighting (see https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.

mrocklin commented 6 years ago

+1 on avoiding bag in performance sensitive code :)

On Wed, Jan 24, 2018 at 5:56 PM, Roman Yurchak notifications@github.com wrote:

An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer or HashingVectorizer on chunks of the dataset, merge the results and then apply IDF weighting (see FreeDiscovery/FreeDiscovery#152 https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-360302286, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBORFY_eHqD6DQEuYQgD4cNmXcsYks5tN7UigaJpZM4PiBOZ .

stsievert commented 3 years ago

RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31.

They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140

mrocklin commented 3 years ago

Beware that cuml typically does relatively GPU-centric things. Their algorithms are rarely generalizable to non-RAPIDS use cases. I would be curious to see what a CPU implementation would look like with more traditional dataframe/array operations.

On Thu, Sep 24, 2020 at 8:45 AM Scott Sievert notifications@github.com wrote:

RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 .

They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-698427819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAZRKY6QJBUTT6DJ6TSHNSQPANCNFSM4D4ICOMQ .

cakiki commented 2 years ago

Sorry to revive an old thread, but was this implemented anywhere?