I want build a wordcloud from news articles, and in those people and places might have more than one word to describe them (eg "New York" or "Ursula Von Der Leyen".
With the current tokenization New becomes "indipendent" from "York", same with the parts of a name.
Is there a way to represent these "groups" so they stay together?
For example I could imagine a format as:
"Rome London 'New York' Biden 'Rishi Sunak' Mumbai"
where the single quote would mean "keep the string together".
Thanks for any ideas and thank you SO MUCH for this wonderful library !!!!
I want build a wordcloud from news articles, and in those people and places might have more than one word to describe them (eg "New York" or "Ursula Von Der Leyen".
With the current tokenization New becomes "indipendent" from "York", same with the parts of a name.
Is there a way to represent these "groups" so they stay together?
For example I could imagine a format as: "Rome London 'New York' Biden 'Rishi Sunak' Mumbai" where the single quote would mean "keep the string together".
Thanks for any ideas and thank you SO MUCH for this wonderful library !!!!