add jieba tokenizer for Chinese corpus.

BrightXiaoHan commented 2 years ago

For Chinese corpus, the tokenization result by fast-mosestokenizer is as followed:

origin: '同时，祖马革命的一代似乎对领导打破种族隔离制度15年后的南非，还不适应。 在一个尊敬老人的地区，祖马对其乡下传统的依恋，必须和对南非年青人的爱好所持的平等而开放态度相平衡。'

tokenized: '同时 ， 祖马革命的一代似乎对领导打破种族隔离制度 15 年后的南非 ， 还不适应 。 在一个尊敬老人的地区 ， 祖马对其乡下传统的依恋 ， 必须和对南非年青人的爱好所持的平等而开放态度相平衡 。'

Jieba is a popular Chinese tokenizer which will produce the following result:

'同时 ， 祖马 革命 的 一代 似乎 对 领导 打破 种族隔离 制度 15 年 后 的 南非 ， 还 不 适应 。   在 一个 尊敬 老人 的 地区 ， 祖马 对 其 乡下 传统 的 依恋 ， 必须 和 对 南非 年青人 的 爱好 所持 的 平等 而 开放 态度 相平衡 。'

Use jieba is a better choice for Chinese corpus.

svirpioj commented 2 years ago

Thanks, looks nice! Could you please add some simple unit test (e.g. for moses for en and jieba for zh) here: https://github.com/Helsinki-NLP/OpusFilter/blob/develop/tests/test_preprocessors.py#L71

Use @unittest.skipIf(...) to ignore the test if the jieba library is not available.

BrightXiaoHan commented 2 years ago

Ok, test cases were added.

BrightXiaoHan commented 2 years ago

Should I add jieba to requirements.txt directly?

svirpioj commented 2 years ago

@BrightXiaoHan: If you would prefer that, I can also wrap this up.

svirpioj commented 2 years ago

Replaced by https://github.com/Helsinki-NLP/OpusFilter/pull/27

Helsinki-NLP / OpusFilter

add jieba tokenizer for Chinese corpus. #23