XMUNLP / XMUNMT

An implementation of RNNsearch using TensorFlow
BSD 3-Clause "New" or "Revised" License
68 stars 24 forks source link

Where can I download the dataset and how to prepare training corpus exactly? #1

Closed Jmq14 closed 8 years ago

Jmq14 commented 8 years ago

Hi @XMU-NLPLAB

I am trying to run your code but I cannot find the dataset or any download link. I've tried the openmt15 dataset but it seems the official registration and data-download link is not available now. BTW, in the command python preprocess.py -d vocab.zh.pkl -v 30000 -b bintext.zh.pkl -p zh.txt What do vocab.zh.pkl, bintext.zh.pkl and zh.txt represent respectively?

I am a beginner in NMT. Can you offer any information or resources? That would be really help!

XMU-NLPLAB commented 8 years ago

Unfortunately, we cannot provide our training corpus, however, you can use the public available WMT14 corpus, the download link can be found in the paper Neural Machine Translation by Jointly Learning to Align and Translate. To run the code, you need to provide bilingual training corpus in plain text format (one sentence a line). The script preprocess.py is copied from Groundhog. It is mainly used to build vocabulary. Suppose source side of your training corpus is named zh.txt, target side of your training corpus is named en.txt, the script preprocess.py can build vocabulary vocab.zh.pkl and binarized corpus bintext given zh.txt.