Jingjing-NLP / VOLT

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"
439 stars 46 forks source link

Codebase and data are uploaded in progress.

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.
To help more readers understand our work better, I write a blog at this repo.

What's New:

What's On-going:

Features:

Requirements and Installation

The required environments:

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder.git
git clone https://github.com/rsennrich/subword-nmt.git
pip3 install sentencepiece
pip3 install tqdm 
cd POT
pip3 install --editable ./ -i https://pypi.doubanio.com/simple --user
cd ../

Usage

Examples

We have given several examples in path "examples/", including En-De translation, En-Fr translation, multilingual translation, and En-De translation without joint vocabularies.

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED X-EN data, you can download at X-EN. For TED EN-X data, you can download at EN-X

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}