amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.13k stars 2.32k forks source link

filter phrases in the wordcloud #560

Open junxu-ai opened 4 years ago

junxu-ai commented 4 years ago

Description

It seems that the current code cannot filter out the phrases more than 1 word.

Steps/Code to Reproduce

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib. pyplot as plt

strs=""" One possible pattern possible pattern possible pattern possible pattern possible pattern is to run the long callback as a background task which records its process to a shared file system or database, and then have the app periodically check the progress to update the progress bar. Here’s 126 a little demo of that. How would this implementation work? In my situation, I click a button which runs a callback function. The callback function does a long process with a big for loop. How can I simultaneously extract the loop number for the progress bar given that my function is running inside a callback? Ideally the background process would notify the app directly rather than the app polling for updates, but I think for that type of thing we’d need websocket support in Dash? Not actually totally sure. The paper named “Attention is All You Need” by Vaswani et al is one of the most important contributions to Attention so far. They have redefined Attention by providing a very generic and broad definition of Attention based on key, query, and values. They have referenced another concept called multi-headed Attention. Let’s discuss this briefly.

First, let’s define what “self-Attention” is. Cheng et al, in their paper named “Long Short-Term Memory-Networks for Machine Reading”, defined self-Attention as the mechanism of relating different positions of a single sequence or sentence in order to gain a more vivid representation.

Machine reader is an algorithm that can automatically understand the text given to it. We have taken the below picture from the paper. The red words are read or processed at the current instant, and the blue words are the memories. The different shades represent the degree of memory activation.

When we are reading or processing the sentence word by word, where previously seen words are also emphasized on, is inferred from the shades, and this is exactly what self-Attention in a machine reader does. """

STOPWORDS.add('possible pattern')

wordclouds=WordCloud(stopwords=STOPWORDS, collocations=True).generate(strs) plt.figure() plt.imshow(wordclouds)

after looking into the code, i wonder if it is possible to add an additional filter function in the "unigrams_and_bigrams", e.g.,

bigrams = list(p for p in pairwise(words) if not any(w in stopwords for w in p))
--> insert a further loop here to check if bigrams contains any bigram. 
n_words = len(words)

Below are some sample codes::::

def bigram_stop(STOPWORDS):
    #bigram=[]
    bigram_tup_stop=[]
    for w in STOPWORDS:
        s=w.strip().split(' ')
        if len(s)>1:
            # bigram.append(w)
            bigram_tup_stop.append((s[0],s[1]))    
    return bigram_tup_stop

def remove_bigram(bigrams, bigram_tup_stop):
    bigrams=[s for s in bigrams if not(s in bigram_tup_stop)]                    
    return bigrams

bigrams=remove_bigram(bigrams, bigram_stop(STOPWORDS))
amueller commented 4 years ago

I think this is a duplicate of #558 and fixed in the current master branch, do you want to check? I'm working of getting out a new release.

junxu-ai commented 4 years ago

Sounds great! Thanks!

The code above is intended to remove phrases with 2 words (can be extended to multi-words). The current implementation seems only to remove single word.

amueller commented 4 years ago

Indeed, the fix will still only be for filtering out bigrams that contain a particular word, not a particular combinations of words. Can you give me an example for where that would be useful? You can always do custom preprocessing and pass it to generate_from_frequencies. It's really hard to bake in options for all possible things that people might want.

junxu-ai commented 4 years ago

sorry that maybe i made it more confused. Actually, I only intend to filter out some particular bigrams; not the combinations. e.g., STOPWORDS.add('possible pattern') in the example above. Thanks.

amueller commented 4 years ago

Ok, got it. Can you give an example of where or why you'd do that?

junxu-ai commented 4 years ago

In one of our applications, we need to filter out some very common bigrams from the wordcloud, as those bigrams apprear in almost every document. and thus it makes no sense to include them.

amueller commented 4 years ago

To make sure I understand what you're looking for, you want the individual words to be counted, unless they appear together? so "possible pattern pattern" would count "possible" zero times and "pattern" once?

You can always do doc.replace("my pattern", "") btw, does that not solve the issue?

junxu-ai commented 4 years ago

in my experience, the current implementation cannot directly filter out bigram stopwords, e.g. "good example", "bigram words", "network cloud", etc. I.e., if i input these phrases into the stopword list directly, they will still appear in the wordcloud. Note that i don't want to filter out the single words, e.g., "example", "good", "network". Hope it clarifies.

amueller commented 4 years ago

Ok yes that's clear. But you could do doc.replace("my pattern", "") right?

junxu-ai commented 4 years ago

Thanks! yes, you are right. However, the stopword list is created for several usages. I didn't separate the single word and bigram phrases to make the list more consistent (readability).

amueller commented 4 years ago

You could remove the single word stopwords the same way though if you want to send a PR.