ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.15k stars 210 forks source link

Forign Letters with diacritic marks cannot be processed correctly #162

Closed ahmetaa closed 6 years ago

ahmetaa commented 6 years ago

Analysis of José throws an exception when converting characters to TurkishLetters. Only a log message is shown and system returns a WordResult without an analysis. Perhaps we should not try to add this to StemTransitions data structure to prevent exception message.

ahmetaa commented 6 years ago

This only occurs if letter with diacritic mark is used for phonetic attribute calculations.

ahmetaa commented 6 years ago

I added normalization of such letters to the system (only the ones in ASCII table). However a more general solution is required as it still throws for inputs contain other symbols like ™ © etc. I will open a separate issue for it.