amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.07k stars 2.31k forks source link

Recognizing chunks? #145

Closed fedefilo closed 8 years ago

fedefilo commented 8 years ago

Is there a way to configure the library to recognize text chunks such as 'World War II' or 'Federal Republic of Germany'? Thank you! It's really a great library!

amueller commented 8 years ago

No. You can use a custom tokenizer to check for bigrams or trigrams (i.e. pairs of two or three words, check out scikit-learn, nltk or spacy). But it will find all kinds of words that appear often next to each other. If you're lucky, the once that are frequent are meaningful. It depends on your text. If you are particularly interested in names like the examples you gave, you need to do what is called "named entity recognition". Maybe check if nltk of spacy has some ready-made solutions for that.

It's relatively easy to do just pairs of words by adjusting the regexp. but if you want pairs of words and single words, you need to create a dictionary of words to frequencies yourself and call the from_frequencies function.

fedefilo commented 8 years ago

Thank you very much for your answer!

I am working with NLTK to isolate the chunks. I made a couple of tries with the from_frequencies method and it worked fine.

An extra question not directly related to wordcloud, but maybe you can point me in the right direction. I am researching about the evolution of keywords usage [thematic agendas] in an academic journal. I can make a wordcloud of the whole journal or some selected issues. That gives me a static picture of the most frequently used keywords. My question is if you can imagine a good way to visualize the evolution of the keywords usage. I haven't found a good graph that can easily communicate the changes and trends in keyword utilization. Thanks in advance and sorry if the question is too far off-topic! Federico

amueller commented 8 years ago

You could track the evolution of topics over time: http://www.cs.columbia.edu/~blei/papers/WangBleiHeckerman2008.pdf

You could do a hacky version of that by fitting the latent dirichlet allocation from scikit-learn and using partial_fit to update it (if you refit it for each time period, you will probably not be able to identify the topics).

ps: I'd be interested in seeing your results with nltk ;)

fedefilo commented 8 years ago

thank you very much for the paper,, it is a bit above my math level but I will ask a friend for help.

It wasn't that easy with nltk because I realized I needed to train a POS tagger in Spanish first. However, I was able to extract keywords directly from the journal (instead of working with abstracts and titles) that are already recognized as chunks. Using the from_frequencies method I got this cloud for example: http://imgur.com/GB91e2i Some chunks as America Latina, crisis económica, cultura política, can be seen.

On Wed, Apr 27, 2016 at 3:48 PM, Andreas Mueller notifications@github.com wrote:

You could track the evolution of topics over time: http://www.cs.columbia.edu/~blei/papers/WangBleiHeckerman2008.pdf

You could do a hacky version of that by fitting the latent dirichlet allocation from scikit-learn and using partial_fit to update it (if you refit it for each time period, you will probably not be able to identify the topics).

ps: I'd be interested in seeing your results with nltk ;)

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/amueller/word_cloud/issues/145#issuecomment-215224065

amueller commented 8 years ago

nice :)

amueller commented 8 years ago

This example might help: https://github.com/amueller/word_cloud/blob/master/examples/bigrams.py