amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.08k stars 2.31k forks source link

Detect phrases #283

Open butlerm1977 opened 7 years ago

butlerm1977 commented 7 years ago

Feature request here. I would like to be able to have wordcloud be able to detect commonly detected phrases. For instance rather than "hole in one" be detected as 3 different words, if it appears in that order multiple times then the phrase could carry weight, as opposed to the individual words. Another example would be "toll road" or "run of the mill".

This feature could be turned on and off with a --phrase tag at command line.

For added bonus, the phrase detection length could be fine tuned at the command line. For instance, --phrase=2 could parse for phrases of a max length of 2 words.

amueller commented 7 years ago

This is actually implemented and turned on by default for phrases of length two. Run the "new hope" example and you'll see "death star" as a single phrase, or "United States" for the Constitution. This is controlled by the collocation parameter (which is the technical term for the way phrases are detected). This could be expanded to longer phrases, but I haven't done that, mostly because I didn't need it and didn't have the time. Pull request welcome. There's a reference in the code for the paper, or you can check out nltk. I'm using a heuristic to discount the words that make up the phrase, that would need to be extended to lager phrases, too, but that shouldn't be a problem.

amueller commented 6 years ago

Having longer collocations might be interesting still.