This repository provides scripts, including preprocessing and training, for our WMT21's paper, Back-translation for Large-Scale Multilingual Machine Translation.
The installation instruction borrowed from fairseq. In case of version problem, we offer the fairseq we trained with.
git clone https://github.com/BaohaoLiao/multiback.git
cd multiback
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
All data and pretrained models are available in the challenge page. We mainly show how to process the data of small task #2. For the data of small task #1, just modify the lines with "# TODO" in the scripts for small task #2.
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
tar -zxvf flores101_mm100_615M.tar.gz
rm flores101_mm100_615M.tar.gz
Pretrained Model (name in our paper) | Original Name | Download |
---|---|---|
Trans_small | flores101_mm100_175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz |
Trans_base | flores101_mm100_615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz |
Trans_big | m2m_100 | https://dl.fbaipublicfiles.com/m2m_100/1.2B_last_checkpoint.pt |
Download datasets.
mkdir data
cd data
# evaluation set
wget https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz
tar -xzvf flores101_dataset.tar.gz
rm flores101_dataset.tar.gz
# training set
wget https://data.statmt.org/wmt21/multilingual-task/small_task2_filt_v2.tar.gz
tar -xzvf small_task2_filt_v2.tar.gz
rm small_task2_filt_v2.tar.gz
Process parallel data
cd data_scripts
# process evaluation set
bash processEvaluationSetForSmallTask2.sh
# process training set
python concatenate.py # Concatenate the files with the same translation directions
bash processTrainSetForSmallTask2.sh
Note: For Trans_big, you need to process data like https://github.com/facebookresearch/fairseq/tree/main/examples/m2m_100
All training scripts are in train_scripts
cd train_scripts
bash transBaseForSmallTask2ParallelData.sh
Here we list the number of GPUs used for each script. If you don't have enough GPUs, just change the flag --update-freq to match our setting. We don't really tune the hyper-parameters and mainly borrow them from fairseq examples.
Task | Model | Script | #GPU | #epoch |
---|---|---|---|---|
Small Task #2 | Trans_small | transSmallForSmallTask2ParallelData.sh | 32 | 1 |
Small Task #2 | Trans_base | transBaseForSmallTask2ParallelData.sh | 32 | 2 |
Small Task #2 | Trans_big | transBigForSmallTask2ParallelData.sh | 128 | 2 |
git clone --single-branch --branch adding_spm_tokenized_bleu https://github.com/ngoyal2707/sacrebleu.git
cd sacrebleu
python setup.py install
cd generationAndEvaluation_scripts
bash generateAndEvaluateForSmallTask2.sh
Please cite as:
@inproceedings{liao-etal-2021-back,
title = "Back-translation for Large-Scale Multilingual Machine Translation",
author = "Liao, Baohao and
Khadivi, Shahram and
Hewavitharana, Sanjika",
booktitle = "Proceedings of the Sixth Conference on Machine Translation",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.wmt-1.50",
pages = "418--424",
}