amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.1k stars 2.31k forks source link

Keeping together multi-word tokens #761

Open rjalexa opened 6 months ago

rjalexa commented 6 months ago

I want build a wordcloud from news articles, and in those people and places might have more than one word to describe them (eg "New York" or "Ursula Von Der Leyen".

With the current tokenization New becomes "indipendent" from "York", same with the parts of a name.

Is there a way to represent these "groups" so they stay together?

For example I could imagine a format as: "Rome London 'New York' Biden 'Rishi Sunak' Mumbai" where the single quote would mean "keep the string together".

Thanks for any ideas and thank you SO MUCH for this wonderful library !!!!