Open butlerm1977 opened 7 years ago
This is actually implemented and turned on by default for phrases of length two. Run the "new hope" example and you'll see "death star" as a single phrase, or "United States" for the Constitution. This is controlled by the collocation parameter (which is the technical term for the way phrases are detected). This could be expanded to longer phrases, but I haven't done that, mostly because I didn't need it and didn't have the time. Pull request welcome. There's a reference in the code for the paper, or you can check out nltk. I'm using a heuristic to discount the words that make up the phrase, that would need to be extended to lager phrases, too, but that shouldn't be a problem.
Having longer collocations might be interesting still.
Feature request here. I would like to be able to have wordcloud be able to detect commonly detected phrases. For instance rather than "hole in one" be detected as 3 different words, if it appears in that order multiple times then the phrase could carry weight, as opposed to the individual words. Another example would be "toll road" or "run of the mill".
This feature could be turned on and off with a --phrase tag at command line.
For added bonus, the phrase detection length could be fine tuned at the command line. For instance, --phrase=2 could parse for phrases of a max length of 2 words.