Open zafercavdar opened 5 years ago
I was using Polyglot for tokenizing documents with hashtags in Swedish. I recognized that the tokenizer splits the first letters of the hashtags after an emoji as separate words. For example, for the given document,
doc = "#hållbarhet #miljövänligt 😊 #ecofashion #levaobo #miljösmart #influencers" words = Text(doc, hint_language_code='sv').words print(words)
it prints
WordList(['#', 'hållbarhet', '#', 'miljövänligt', '😊', '#', 'e', 'cofashion', '#', 'l', 'evaobo', '#', 'm', 'iljösmart', '#', 'i', 'nfluencers'])
While #hållbarhet is tokenized as ["#", "hållbarhet" ], #ecofashion turns into ["#", "e", "cofashion"]
#hållbarhet
["#", "hållbarhet" ]
#ecofashion
["#", "e", "cofashion"]
I was using Polyglot for tokenizing documents with hashtags in Swedish. I recognized that the tokenizer splits the first letters of the hashtags after an emoji as separate words. For example, for the given document,
it prints
While
#hållbarhet
is tokenized as["#", "hållbarhet" ]
,#ecofashion
turns into["#", "e", "cofashion"]