Blickwinkel1107 / making-the-most-of-context-nmt

NJUNMT for docNMT
MIT License
16 stars 4 forks source link

Towards Making the Most of Context in NMT (IJCAI 2020)


English, 中文

License: MIT Build Status

NJUNMT-pytorch-DocNMT is the implementation of paper "Towards Making the Most of Context in Neural Machine Translation", and is based on NJUNMT-pytorch, the NMT tool-kit.

Table of Contents

Bibtex

@inproceedings{zheng2020towards,
  title={Towards Making the Most of Context in Neural Machine Translation},
  author={Zheng, Zaixiang and Yue, Xiang and Huang, Shujian and Chen, Jiajun and Birch, Alexandra},
  booktitle={IJCAI-PRICAI},
  year={2020}
}

Requirements

Usage

0. Quick Start

We provide push-button scripts to setup training and inference of our model Corpus. Just execute under root directory of this repo

bash ./scripts/train.sh

for training and

bash ./scripts/translate.sh

for decoding. Detailed setups are as follows.

1. Data Preprocessing

1.1 Download Data

Training dataset in paper are listed as follows:

ZH-EN

IWSLT2015 (TED15)

EN-DE

IWSLT2017 (TED17)

News Commentary v11 (News)

Europarl v7

EN-RU

Training data

Contrastive test sets

Please refer to here to learn how Voita et al. configure and run models on contrastive dataset.

1.2 Tokenization

We suggest using Jieba to tokenize Chinese corpus and use scripts of mosesdecoder to tokenize non-Chinese corpus.

1.3 Byte-Pair Encoding (Optional)

See subword-nmt.

1.4 Building Vocabulary

To generate vocabulary files for both source and target language, we provide a script in ./data/build_dictionary.py to build them in json format.

See how to use this script by running:

python ./scripts/build_dictionary.py --help

We highly recommend not to set the limitation of the number of words and control it by config files while training.

1.5 Documental Data Format for Model Processing

Ours model need to partition data, so the original data need to be processed in a legal format. The format of a file containing M documents and N sentences in each document is:

sent1_of_doc1 <EOS> <BOS> sent2_of_doc1 <EOS> <BOS> ... <EOS> <BOS> sentN_of_doc1
sent1_of_doc2 <EOS> <BOS> sent2_of_doc2 <EOS> <BOS> ... <EOS> <BOS> sentN_of_doc2
...
sent1_of_docM <EOS> <BOS> sent2_of_docM <EOS> <BOS> ... <EOS> <BOS> sentN_of_docM

In terms of the limited memory, we partition the original document as up to 20 sentences as a group. In fact our model supports processing any amount of sentences in a document. Please see data_format/dev.en.20.sample to learn the sample of data format.

2. Write Configuration File

See examples in ./configs folder. We provide several examples:

To further learn how to configure a NMT training task, see this wiki page.

3. Training

We can setup a training task by running

export CUDA_VISIBLE_DEVICES=0
python -m src.bin.train \
    --model_name <your-model-name> \
    --reload \
    --config_path <your-config-path> \
    --log_path <your-log-path> \
    --saveto <path-to-save-checkpoints> \
    --valid_path <path-to-save-validation-translation> \
    --use_gpu

See detail options by running python -m src.bin.train --help.

During training, checkpoints and best models will be saved under the directory specified by option ---saveto. Suppose that the model name is "MyModel", there would be several files under that directory:

4. Translation

When training is over, our code will automatically save the best model. Usually you could just use the final best model, which is named as xxxx.best.final, to translate. This model achieves the best performance on the validation set.

We can translation any text by running:

export CUDA_VISIBLE_DEVICES=0
python -m src.bin.translate \
    --model_name <your-model-name> \
    --source_path <path-to-source-text> \
    --model_path <path-to-model> \
    --config_path <path-to-configuration> \
    --batch_size <your-batch-size> \
    --beam_size <your-beam-size> \
    --alpha <your-length-penalty> \
    --use_gpu

See detail options by running python -m src.bin.translate --help.

Also our code support ensemble decoding. See more options by running python -m src.bin.ensemble_translate --help