Codebase and data are uploaded in progress.
VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.
To help more readers understand our work better, I write a blog at this repo.
The required environments:
To use VOLT and develop locally:
git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder.git
git clone https://github.com/rsennrich/subword-nmt.git
pip3 install sentencepiece
pip3 install tqdm
cd POT
pip3 install --editable ./ -i https://pypi.doubanio.com/simple --user
cd ../
The first step is to get vocabulary candidates based on tokenized texts. Notice: the tokenized texts should be in charater level. Please do not use segmentation tools to segment your texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples.
#Assume source_file is the file stroing texts in the source data
#Assume target_file is the file stroing texts in the target data
size=30000 # the size of BPE
cat source_file > training_data
cat target_file >> training_data
mkdir bpeoutput BPE_CODE=bpeoutput/code # the path to save vocabulary python3 subword-nmt/learn_bpe.py -s $size < training_data > $BPE_CODE python3 subword-nmt/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file python3 subword-nmt/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/target.file
cd examples mkdir spmout python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
sed -i 's/\t/ /g' spm.vocab python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece
* This example shows how to get a vocabulary from a single file for non-seq2seq tasks.
size=30000 # the size of BPE
mkdir bpeoutput BPE_CODE=bpeoutput/code # the path to save vocabulary python3 subword-nmt/learn_bpe.py -s $size < source_file > $BPE_CODE python3 subword-nmt/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
cd examples mkdir spmout python3 spm/spm_train.py --input=source_file --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece
The second step is to run VOLT scripts. It accepts the following parameters:
#For seq2seq tasks with source file and target file, you can use the following commands:
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
--token_candidate_file $BPE_CODE \
--vocab_file bpeoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
--token_candidate_file spm.vocab \
--vocab_file spmoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size
python3 ../ot_run.py --source_file bpeoutput/source.file \ --token_candidate_file $BPE_CODE \ --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size
BPE_CODE=spm.vocab python3 ../ot_run.py --source_file spmoutput/source.file \ --token_candidate_file spm.vocab \ --vocab_file spmoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size
The third step is to use the generated vocabulary to segment your texts:
#subword-nmt style
echo "#version: 0.2" > bpeoutput/vocab.seg # add version info
echo bpeoutput/vocab >> bpeoutput/vocab.seg
python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab.seg < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab.seg < target_file > bpeoutput/source.file #optional if your task does not contain target texts
#sentencepiece style
#for sentencepiece toolkit, here we only keep the optimal size
best_size=$(cat spmoutput/size)
#training_data contains source data and target data (optional if target data is provided)
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece #optional if your task does not contain target texts
The last step is to use the segmented texts for downstream tasks. You can use the repo Fairseq for training and evaluation. We also upload the training and evaluation code in path "examples/". Notice: For a comparison of BLEU, you need to do "remove-bpe" operations for the generated texts.
We have given several examples in path "examples/", including En-De translation, En-Fr translation, multilingual translation, and En-De translation without joint vocabularies.
The WMT-14 En-de translation data can be downloaed via the running scripts.
For TED X-EN data, you can download at X-EN. For TED EN-X data, you can download at EN-X
Please cite as:
@inproceedings{volt,
title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
author= {Jingjing Xu and
Hao Zhou and
Chun Gan and
Zaixiang Zheng and
Lei Li},
booktitle = {Proceedings of ACL 2021},
year = {2021},
}