This repo contains code for paper Fast Nearest Neighbor Machine Translation.
Model | WMT19 De-En | WMT14 En-Fr |
---|---|---|
base MT | 37.6 | 41.1 |
+kNN-MT | 39.1(+1.5) | 41.8(+0.7) |
+fast kNN-MT | 39.3(+1.7) | 41.7(+0.6) |
Model | Medical | IT | Koran | Subtitles | Avg. |
---|---|---|---|---|---|
base MT | 39.9 | 38.0 | 16.3 | 29.2 | 33.8 |
+kNN-MT | 54.4(+14.5) | 45.8(+7.8) | 19.4(+3.1) | 31.7(+2.5) | 42.6(+8.8) |
+fast kNN-MT | 53.6(+13.7) | 45.5(+7.5) | 21.2(+4.9) | 32.1(+2.9) | 41.4(+7.6) |
pip install -r requirements.txt
fairseq==0.10.2
to extract features used in our paper. For details of installation and
how we modify the codes, see the corresponding README file.For each sentence-pair dataset, we do the following preprocessing steps:
The example scripts for preprocessing domain-adaptation/WMT data are listed below:
thirdparty/fairseq/extract_feature_scripts/prepare-domain-adapt_with_pretrained_wmt19.sh
thirdparty/fairseq/extract_feature_scripts/prepare-wmt14en2fr_with_pretrained_wmt14.sh
thirdparty/fairseq/extract_feature_scripts/prepare-wmt19en2de_with_pretrained_wmt19.sh
To find token-neighbors on source side, we do the following steps:
Datastore
for each token, whose keys are the token-representations, and value are its offsets(sent_id, token_id).
Note that the value of Datastore
here is the offsets instead of pair of values in paper due to engineering reasons. We will
use the alignments from source to target at decoding stage.Datastore
for approximate nearest neighbors(ANN) searchThe example scripts for find knn neighbors for domain-adaptation/WMT data are listed below:
fast_knn_nmt/scripts/domain-adapt/find_knn_neighbors.sh
fast_knn_nmt/scripts/wmt-en-fr/find_knn_neighbors.sh
fast_knn_nmt/scripts/wmt19-de-en/find_knn_neighbors.sh
To convert pretrained fairseq Seq2Seq ckpt to do inference, use fast_knn_nmt/custom_fairseq/train/transform_ckpt.py
This script would change the task/model name of pretrained fairseq checkpoint, and adding quantizer to the
model.
Note that you should change TRANSFORMER_CKPT
, TRANSFORMER_CKPT
and QUANTIZER_PATH
to your
own path.
The example scripts of inference for domain-adaptation/WMT data are listed below:
fast_knn_nmt/scripts/domain-adapt/reproduce_${domain}.sh
, where domain
could be it
, medical
, koran
, law
or subtitles
.fast_knn_nmt/scripts/wmt-en-fr/inference.sh
fast_knn_nmt/scripts/wmt19-de-en/inference.sh
Note that you should change USER_DIR
, DATA_DIR
, OUT_DIR
, and DETOKENIZER
to your own path.
@article{meng2021fast,
title={Fast Nearest Neighbor Machine Translation},
author={Meng, Yuxian and Li, Xiaoya and Zheng, Xiayu and Wu, Fei and Sun, Xiaofei and Zhang, Tianwei and Li, Jiwei},
journal={arXiv preprint arXiv:2105.14528},
year={2021}
}
If you have any issues or questions about this repo, feel free to contact yuxian_meng@shannonai.com.