WorksApplications / SudachiPy

Python version of Sudachi, a Japanese tokenizer.
Apache License 2.0
392 stars 50 forks source link

Fix a bug causing … is converted to "", "", "…" #121

Closed sorami closed 4 years ago

sorami commented 4 years ago

Apply the same fix as this PR https://github.com/WorksApplications/Sudachi/pull/118/ for Java implementation.

Related: #120

When there are more tokens than the original, due to the normalization, set the original to the first output token, not the last.

For example, currently,

$ echo … | sudachipy
    補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
…   補助記号,句点,*,*,*,* .
EOS

This will be fixed to

$ echo … | sudachipy
…   補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
    補助記号,句点,*,*,*,* .
EOS