diasks2 / pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
MIT License
90 stars 11 forks source link

Properly detect emoticons #25

Open diasks2 opened 8 years ago

diasks2 commented 8 years ago
#4
it 'preserves emoticons' do
  text = "lol :-D"
  pt = PragmaticTokenizer::Tokenizer.new(text, downcase: false)
  expect(pt.tokenize).to eq(
    ["lol", ":-D"]
  )
end
maia commented 8 years ago

I've just came across retext-emoji and wonder if it might be smart to convert emoticons to emoji:

When encode, converts short-codes into their unicode equivalent (e.g., :heart: and <3 to ❤️)

While I think it might be too much ton attempt to convert all possible emoticons, one could pragmatically do so for the 10-20 most common emoticons very early in the processing, and only later handle remaining punctuation.

diasks2 commented 8 years ago

Interesting idea. It might be something to consider doing internally so we don't confuse emoticons with other punctuation...however, in the output that is returned to a user I am kind of a purist and think that it should match the original input (i.e. <3 in the input text would not return ❤️‍ as a token)