misunderstanding about segment

Thanks for the report. The problem with " " inside token has been fixed with commit https://github.com/coccoc/coccoc-tokenizer/commit/18a9b9367b70909bc52c98d69023169ff787727d. The idea of space_positions is to save the result of sticky-text-segmentation, it has nothing to do with normal tokens.

Could you provide examples for other issues?

" " should not be considered as a token (except in for_transforming mode, which is CocCoc specific). If that happens, it can be a bug.
Both segmentand segment_original shouldn't return any punctuation marks, though segment_original is expected to keep original text format (case-sensitive). For instance, I have:
- segment("Tôi Đăng ký trên theGioididong.vn") => {"tôi", "đăng ký", "trên", "the gioi", "di dong", "vn"}
- segment_original("Tôi Đăng ký trên theGioididong.vn") => {"Tôi", "Đăng_ký", "trên", "the_Gioi", "di_dong", "vn"}
Punctuation marks are included in output of tokenizer tool for convenience but they are excluded from segment_original's return vector. It also shows that one can recover tokens separator from token endpoints.

coccoc / coccoc-tokenizer