Closed Jmq14 closed 8 years ago
Unfortunately, we cannot provide our training corpus, however, you can use the public available WMT14 corpus, the download link can be found in the paper Neural Machine Translation by Jointly Learning to Align and Translate.
To run the code, you need to provide bilingual training corpus in plain text format (one sentence a line). The script preprocess.py
is copied from Groundhog. It is mainly used to build vocabulary. Suppose source side of your training corpus is named zh.txt
, target side of your training corpus is named en.txt
, the script preprocess.py
can build vocabulary vocab.zh.pkl
and binarized corpus bintext
given zh.txt
.
Hi @XMU-NLPLAB
I am trying to run your code but I cannot find the dataset or any download link. I've tried the openmt15 dataset but it seems the official registration and data-download link is not available now. BTW, in the command
python preprocess.py -d vocab.zh.pkl -v 30000 -b bintext.zh.pkl -p zh.txt
What dovocab.zh.pkl
,bintext.zh.pkl
andzh.txt
represent respectively?I am a beginner in NMT. Can you offer any information or resources? That would be really help!