Closed BrightXiaoHan closed 2 years ago
Thanks, looks nice! Could you please add some simple unit test (e.g. for moses for en and jieba for zh) here: https://github.com/Helsinki-NLP/OpusFilter/blob/develop/tests/test_preprocessors.py#L71
Use @unittest.skipIf(...)
to ignore the test if the jieba
library is not available.
Ok, test cases were added.
Should I add jieba
to requirements.txt
directly?
@BrightXiaoHan: If you would prefer that, I can also wrap this up.
For Chinese corpus, the tokenization result by
fast-mosestokenizer
is as followed:Jieba is a popular Chinese tokenizer which will produce the following result:
Use jieba is a better choice for Chinese corpus.