amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.19k stars 2.32k forks source link

How many top % of words are taken for word cloud? #713

Closed Abe410 closed 1 year ago

Abe410 commented 1 year ago

Just curious, if we are generating words from text, then how many top words does the cloud use?

And if we generate the cloud using generate_from_frequencies, then how many top frequencies does it use?

amueller commented 1 year ago

Hi Abe. There's the "max_words" parameter that's 200 by default, so it'll use 200 words. If you set it higher, it might not actually show all of them if the font gets to small, which depends on the font settings.

Abe410 commented 1 year ago

Hi Abe. There's the "max_words" parameter that's 200 by default, so it'll use 200 words. If you set it higher, it might not actually show all of them if the font gets to small, which depends on the font settings.

Thank you.

One more question. If we use generate to make a wordcloud, and then use generate_from_frequencies to create it using a count vectorizer with bigrams, is it the same thing?

amueller commented 1 year ago

Not entirely, since wordcloud uses collocation statistics to figure out which bigrams to use, so it doesn't just use the most frequent ones. The regex and normalization in wordcloud is also slightly different than in CountVectorizer, but the main difference is using collocation statistics. Basically generate_from_frequencies bypasses any tokenization logic and assumes that you do all that yourself and will just plot whatever tokens you gave it.

Abe410 commented 1 year ago

Thank you!