THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group
BSD 3-Clause "New" or "Revised" License
701 stars 197 forks source link

In dataset Wmt17 zh-en,The result is not good as wmt14 en-de #108

Open QiyaoHuang opened 2 years ago

QiyaoHuang commented 2 years ago

When I use the dataset wmt14en-de ,I got the bleu score:24.5,which is just like the paper's score, but when I use the same way to train the model with Wmt17 zh-en,the bleu score is only 7.0.

the dataset Wmt17 zh-en: http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz", ["training/news-commentary-v12.zh-en.en", "training/news-commentary-v12.zh-en.zh"]]] why how can I do ?

aseaday commented 2 years ago

How do you tokenize the Chinese corpus?

QiyaoHuang commented 2 years ago

你如何标记中文语料库? 使用本项目模板例子里提供的tokenize方式,和我在wmt14en-de上做法相同