NJUNMT-pytorch-DocNMT is the implementation of paper "Towards Making the Most of Context in Neural Machine Translation", and is based on NJUNMT-pytorch, the NMT tool-kit.
@inproceedings{zheng2020towards,
title={Towards Making the Most of Context in Neural Machine Translation},
author={Zheng, Zaixiang and Yue, Xiang and Huang, Shujian and Chen, Jiajun and Birch, Alexandra},
booktitle={IJCAI-PRICAI},
year={2020}
}
We provide push-button scripts to setup training and inference of our model Corpus. Just execute under root directory of this repo
bash ./scripts/train.sh
for training and
bash ./scripts/translate.sh
for decoding. Detailed setups are as follows.
Training dataset in paper are listed as follows:
Please refer to here to learn how Voita et al. configure and run models on contrastive dataset.
We suggest using Jieba to tokenize Chinese corpus and use scripts of mosesdecoder to tokenize non-Chinese corpus.
See subword-nmt.
To generate vocabulary files for both source and
target language, we provide a script in ./data/build_dictionary.py
to build them in json format.
See how to use this script by running:
python ./scripts/build_dictionary.py --help
We highly recommend not to set the limitation of the number of words and control it by config files while training.
Ours model need to partition data, so the original data need to be processed in a legal format. The format of a file containing M documents and N sentences in each document is:
sent1_of_doc1 <EOS> <BOS> sent2_of_doc1 <EOS> <BOS> ... <EOS> <BOS> sentN_of_doc1
sent1_of_doc2 <EOS> <BOS> sent2_of_doc2 <EOS> <BOS> ... <EOS> <BOS> sentN_of_doc2
...
sent1_of_docM <EOS> <BOS> sent2_of_docM <EOS> <BOS> ... <EOS> <BOS> sentN_of_docM
In terms of the limited memory, we partition the original document as up to 20 sentences as a group. In fact our model supports processing any amount of sentences in a document. Please see data_format/dev.en.20.sample to learn the sample of data format.
See examples in ./configs
folder. We provide several examples:
ted15_ours.yaml
: run our model on TED15ted17_ours.yaml
: run our model on TED17news_ours.yaml
: run our model on Newseuro_ours.yaml
: run our model on EuroparlTo further learn how to configure a NMT training task, see this wiki page.
We can setup a training task by running
export CUDA_VISIBLE_DEVICES=0
python -m src.bin.train \
--model_name <your-model-name> \
--reload \
--config_path <your-config-path> \
--log_path <your-log-path> \
--saveto <path-to-save-checkpoints> \
--valid_path <path-to-save-validation-translation> \
--use_gpu
See detail options by running python -m src.bin.train --help
.
During training, checkpoints and best models will be saved under the directory specified by option ---saveto
. Suppose that the model name is "MyModel", there would be several files under that directory:
MyModel.ckpt: A text file recording names of all the kept checkpoints
MyModel.ckpt.xxxx: Checkpoint stored in step xxxx
MyModel.best: A text file recording names of all the kept best checkpoints
MyModel.best.xxxx: Best checkpoint stored in step xxxx.
MyModel.best.final: Final best model, i.e., the model achieved best performance on validation set. Only model parameters are kept in it.
When training is over, our code will automatically save the best model. Usually you could just use the final best model, which is named as xxxx.best.final, to translate. This model achieves the best performance on the validation set.
We can translation any text by running:
export CUDA_VISIBLE_DEVICES=0
python -m src.bin.translate \
--model_name <your-model-name> \
--source_path <path-to-source-text> \
--model_path <path-to-model> \
--config_path <path-to-configuration> \
--batch_size <your-batch-size> \
--beam_size <your-beam-size> \
--alpha <your-length-penalty> \
--use_gpu
See detail options by running python -m src.bin.translate --help
.
Also our code support ensemble decoding. See more options by running python -m src.bin.ensemble_translate --help