EdinburghNLP / wmt17-scripts

20 stars 3 forks source link

THE UNIVERSITY OF EDINBURGH'S WMT17 SYSTEMS


This directory contains some of the University of Edinburgh's submissions to the WMT17 shared translation task, and a 'training' directory with scripts to preprocess and train your own model.

If you are accessing this through a git repository, it will contain all scripts and documentation, but no model files - the models are accessible at http://data.statmt.org/wmt17_systems

Use the git repository to keep track of changes to this directory: https://github.com/EdinburghNLP/wmt17-scripts

REQUIREMENTS

The models use the following software:

Please set the appropriate paths in the 'vars' file.

DOWNLOAD INSTRUCTIONS

you can download all files in this directory with this command:

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/

to download just one language pair (such as en-de), execute:

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/en-de/

to download just a single model (approx 2GB) and the corresponding translation scripts, ignoring ensembles, execute:

wget -r -e robots=off -nH -np -R *ens2* -R *ens3* -R *ens4* -R *r2l* -R tf-translate-single.sh -R tf-translate-ensemble.sh -R tf-translate-reranked.sh -R index.html* http://data.statmt.org/wmt17_systems/en-de/

if you only download selected language pairs or models, you should also download these files which are shared:

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/scripts/ http://data.statmt.org/wmt17_systems/vars

USAGE INSTRUCTIONS: PRE-TRAINED MODELS

first, ensure that all requirements are present, and that the path names in the 'vars' file are up-to-date. If you want to decode on a GPU, you can also update the 'device' variable in that file.

each subdirectory comes with several scripts tf-translate-*.sh.

For translation with a single model, execute:

./tf-translate-single.sh < your_input_file > your_output_file

the input should be UTF-8 plain text in the source language, one sentence per line.

We also provide ensembles of left-to-right models:

./tf-translate-ensemble.sh < your_input_file > your_output_file

For some language pairs, we built models that use right-to-left models for reranking:

./tf-translate-reranked.sh < your_input_file > your_output_file

We used systems that include ensembles and right-to-left reranking for our official submissions; result may vary slightly from the official submissions due to post-submission improvements - see the shared task description for more details.

USAGE INSTRUCTIONS: TRAINING SCRIPTS

For training your own models, follow the instructions in training/README.md

LEGACY MODELS: THEANO

All models for WMT17 were trained with a legacy version of Nematus, based on Theano. They have been converted to run with the current Tensorflow codebase of Nematus.

To run the original Theano files, install the Theano version of Nematus and set the corresponding $nematus_home path in the 'vars' file:

https://github.com/EdinburghNLP/nematus/tree/theano

The translate scripts ('translate-*') without the 'tf-' prefix can be used to translate with the Theano models and codebase.

LICENSE

All scripts in this directory are distributed under MIT license.

The use of the models provided in this directory is permitted under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (CC BY-NC-SA 3.0): https://creativecommons.org/licenses/by-nc-sa/3.0/

Attribution - You must give appropriate credit [please use the citation below], provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial - You may not use the material for commercial purposes.

ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

REFERENCE

The models are described in the following publication:

Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams (2017). "The University of Edinburgh’s Neural MT Systems for WMT17". In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers. Copenhagen, Denmark.

@inproceedings{uedin-nmt:2017,
    address = "Copenhagen, Denmark",
    author = "Sennrich, Rico and Birch, Alexandra and Currey, Anna and 
              Germann, Ulrich and Haddow, Barry and Heafield, Kenneth and 
              {Miceli Barone}, Antonio Valerio and Williams, Philip",
    booktitle = "{Proceedings of the Second Conference on Machine Translation, 
                 Volume 2: Shared Task Papers}",
    title = "{The University of Edinburgh's Neural MT Systems for WMT17}",
    year = "2017"
}