Closed BrightXiaoHan closed 2 years ago
Thanks for noticing the problem and implementing a fix! The same problem applies also to the MeCab tokenizer. However, the proposed solution is not very general and may mess things up for more complicated input (more spaces, spaces as first or last character). I made an alternative solution in https://github.com/Helsinki-NLP/OpusFilter/pull/50, hope it looks useful.
Chinese tokenizer jieba tokenize Chinese and export an list object. For example:
The tokenize result is as folllow:
output
The space between "Hello world" will be treated as a word. When joined into string, it looks like this
The old detokenize code just split it by space, and join it together. And the origin space between
Hello
andWorld
will disappear.This PR fix this.