ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.14k stars 207 forks source link

Tokenize Emoji characters. #180

Open ahmetaa opened 5 years ago

ahmetaa commented 5 years ago

They are tricky as Java Strings are coded with 16 bit Unicode values (char type). We will need an external library and some pre-post processing when dealing with them