duydo / elasticsearch-analysis-vietnamese

Vietnamese Analysis Plugin for Elasticsearch
Apache License 2.0
509 stars 212 forks source link

Suggestion: better handle combining diacritics #34

Closed Trey314159 closed 7 years ago

Trey314159 commented 7 years ago

Combining characters (incluing diacritics and other characters in non-Latin scripts) cause tokens to split. Some examples from various scripts:

duydo commented 7 years ago

I close this issue, it will be fixed in new tokenizer #37