Closed txdat closed 5 years ago
Thanks for the report.
The problem with " "
inside token has been fixed with commit https://github.com/coccoc/coccoc-tokenizer/commit/18a9b9367b70909bc52c98d69023169ff787727d.
The idea of space_positions
is to save the result of sticky-text-segmentation, it has nothing to do with normal tokens.
Could you provide examples for other issues?
" "
should not be considered as a token (except in for_transforming
mode, which is CocCoc specific). If that happens, it can be a bug.Both segment
and segment_original
shouldn't return any punctuation marks, though segment_original
is expected to keep original text format (case-sensitive). For instance, I have:
segment("Tôi Đăng ký trên theGioididong.vn")
=> {"tôi", "đăng ký", "trên", "the gioi", "di dong", "vn"}
segment_original("Tôi Đăng ký trên theGioididong.vn")
=> {"Tôi", "Đăng_ký", "trên", "the_Gioi", "di_dong", "vn"}
segment_original
's return vector. It also shows that one can recover tokens separator from token endpoints.
I have some misunderstanding about segmentation
" "
as a token?, I think it is meaninglesssegment
method, it keeps case sensitive but removing punctuationssegment_original
method, it makes text case insensitve, but keeps punctuations" "
inside token cannot be changed to"_"
(space_positions
is empty)Can you explain, thanks!