coccoc / coccoc-tokenizer

high performance tokenizer for Vietnamese language
GNU Lesser General Public License v3.0
393 stars 123 forks source link

misunderstanding about segment #3

Closed txdat closed 5 years ago

txdat commented 5 years ago

I have some misunderstanding about segmentation

Can you explain, thanks!

anhducle98 commented 5 years ago

Thanks for the report. The problem with " " inside token has been fixed with commit https://github.com/coccoc/coccoc-tokenizer/commit/18a9b9367b70909bc52c98d69023169ff787727d. The idea of space_positions is to save the result of sticky-text-segmentation, it has nothing to do with normal tokens.

Could you provide examples for other issues?