Where can I download the dataset and how to prepare training corpus exactly?

XMUNLP / XMUNMT

An implementation of RNNsearch using TensorFlow

BSD 3-Clause "New" or "Revised" License

68 stars 24 forks source link

Unfortunately, we cannot provide our training corpus, however, you can use the public available WMT14 corpus, the download link can be found in the paper Neural Machine Translation by Jointly Learning to Align and Translate. To run the code, you need to provide bilingual training corpus in plain text format (one sentence a line). The script preprocess.py is copied from Groundhog. It is mainly used to build vocabulary. Suppose source side of your training corpus is named zh.txt, target side of your training corpus is named en.txt, the script preprocess.py can build vocabulary vocab.zh.pkl and binarized corpus bintext given zh.txt.

XMUNLP / XMUNMT

Where can I download the dataset and how to prepare training corpus exactly? #1