Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

fix jieba tokenize and detokenize funcs. #48

Closed BrightXiaoHan closed 2 years ago

BrightXiaoHan commented 2 years ago

Chinese tokenizer jieba tokenize Chinese and export an list object. For example:

我是一个中国人,我会说英语Hello world!

The tokenize result is as folllow:

import jieba
result = list(jieba.cut("我是一个中国人,我会说英语Hello world!")

output

["我是", "一个", "中国人", ",",  "我",  "会", "说", "英语", "Hello",  " ",  "world", "!"]

The space between "Hello world" will be treated as a word. When joined into string, it looks like this

我是 一个 中国人 ,  我 会 说 英语 Hello   world !

The old detokenize code just split it by space, and join it together. And the origin space between Hello and World will disappear.

我是一个中国人,我会说英语Helloworld!

This PR fix this.

svirpioj commented 2 years ago

Thanks for noticing the problem and implementing a fix! The same problem applies also to the MeCab tokenizer. However, the proposed solution is not very general and may mess things up for more complicated input (more spaces, spaces as first or last character). I made an alternative solution in https://github.com/Helsinki-NLP/OpusFilter/pull/50, hope it looks useful.

svirpioj commented 2 years ago

Replaced by https://github.com/Helsinki-NLP/OpusFilter/pull/50