LIAAD / yake

Single-document unsupervised keyword extraction
https://liaad.github.io/yake
Other
1.64k stars 227 forks source link

How to speed up the application to 100k documents? #49

Closed skwolvie closed 3 years ago

skwolvie commented 3 years ago

Hi, It works well with 1 document however if i want to apply this kw_extractor to a 100k rows of documents with pandas apply it takes more than 2 days to complete. Is there any way of speeding up this process?

CODE:
import yake
st = set(stopwords.words('japanese'))

def keywords_yake(sample_post):
    # take keywords for each post & turn them into a text string "sentence"
    simple_kwextractor = yake.KeywordExtractor(n=3, 
                                            lan='ja',
                                            dedupLim=.99, 
                                            dedupFunc='seqm', 
                                            windowsSize=1, 
                                            top=1000, 
                                            features=None,
                                            stopwords=st)

    post_keywords = simple_kwextractor.extract_keywords(sample_post)

        sentence_output = ""
        for word, number in post_keywords:
            sentence_output += word + " "        
    return " ".join(sentence_output)

df['keywords']= df['docs'].apply(lambda x: keywords_yake(x))
arianpasquali commented 3 years ago

Hi @skwolvie For such volume I would recommend using the SparkNLP's YAkE implementation.

You can find more about it here https://nlp.johnsnowlabs.com/docs/en/annotators#yake

Since you can distribute Spark processing I think it could be a good fit for your use case.

skwolvie commented 3 years ago

Can you please provide a kickstart on how to do it with spark NLP? Also, If that is not possible i would like to understand why it takes so long with pandas apply method. It takes less than a second to apply yake to 1 document but the time taken increases exponentially as the no of documents increases. 100k rows is not a huge dataset.

mrigankgupta commented 2 years ago

I felt that deduplication is where it takes longer than anything.