sentencepiece in prepare-iwslt17-multilingual.sh example

ever4244 commented 3 years ago

I am trying to train a multilingual translation model following the fairseq iwslt17-multilingual example: https://github.com/pytorch/fairseq/tree/master/examples/translation

In my previous script, I have been using fastBPE instead of sentencepiece, after provide the vocab file (obtained from fastBPE) , all the dictionary vocab, dict.en.txt dict.de.txt dict.fr.txt are all the same.

#using vocab and processed file from fastBPE
for lang_pair in de-en de-es de-fr en-es en-fr es-fr; do
    src=`echo $lang_pair | cut -d'-' -f1`
    tgt=`echo $lang_pair | cut -d'-' -f2`
    rm $data_bin/dict.$src.txt $data_bin/dict.$tgt.txt
    fairseq-preprocess --source-lang $src --target-lang $tgt \
        --trainpref $bpe/train.$src-$tgt \
        --joined-dictionary --tgtdict $bpe/vocab \
        --destdir $data_bin \
        --workers 20
done

I want to shift from fastBPE to sentencepiece due to the requirement of some downstream tasks. When I tried to use sentencepiece.bpe.vocab the same way as vocab file from fastBPE it will report an error due to format issue.

In prepare-iwslt17-multilingual.sh, a sentence piece model is learned and apply to the raw training data. I obtained a sentencepiece.bpe.model and sentencepiece.bpe.vocab.

However, in the bash script on fairseq multilingual translation readme (https://github.com/pytorch/fairseq/tree/master/examples/translation), there is no further reference to either file. From the fairseq-preprocess, the dictionary dict.en.txt and dict.de.txt are generated based on train.bpe.de-en. They are different from each other and they are different from sentencepiece.bpe.vocab as well. That is not what I expected.

I feel very puzzled because I want a unified joined dictionary for every language and it should be based on sentencepiece.bpe.vocab or sentencepiece.bpe.model. The way the dict.en.txt produced seems to pretty random for me and the sentencepiece.bpe.vocab is unused in both fairseq-process and fairseq-train.

#using sentencepiece and processed file from prepare-iwslt17-multilingual.sh

# First install sacrebleu and sentencepiece
pip install sacrebleu sentencepiece

# Then download and preprocess the data
cd examples/translation/
bash prepare-iwslt17-multilingual.sh
cd ../..

# Binarize the de-en dataset
TEXT=examples/translation/iwslt17.de_fr.en.bpe16k
fairseq-preprocess --source-lang de --target-lang en \
    --trainpref $TEXT/train.bpe.de-en \
    --validpref $TEXT/valid0.bpe.de-en,$TEXT/valid1.bpe.de-en,$TEXT/valid2.bpe.de-en,$TEXT/valid3.bpe.de-en,$TEXT/valid4.bpe.de-en,$TEXT/valid5.bpe.de-en \
    --destdir data-bin/iwslt17.de_fr.en.bpe16k \
    --workers 10

# Binarize the fr-en dataset
# NOTE: it's important to reuse the en dictionary from the previous step
fairseq-preprocess --source-lang fr --target-lang en \
    --trainpref $TEXT/train.bpe.fr-en \
    --validpref $TEXT/valid0.bpe.fr-en,$TEXT/valid1.bpe.fr-en,$TEXT/valid2.bpe.fr-en,$TEXT/valid3.bpe.fr-en,$TEXT/valid4.bpe.fr-en,$TEXT/valid5.bpe.fr-en \
    --tgtdict data-bin/iwslt17.de_fr.en.bpe16k/dict.en.txt \
    --destdir data-bin/iwslt17.de_fr.en.bpe16k \
    --workers 10

mkdir -p checkpoints/multilingual_transformer
CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
    --max-epoch 50 \
    --ddp-backend=legacy_ddp \
    --task multilingual_translation --lang-pairs de-en,fr-en \
    --arch multilingual_transformer_iwslt_de_en \
    --share-decoders --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --dropout 0.3 --weight-decay 0.0001 \
    --save-dir checkpoints/multilingual_transformer \
    --max-tokens 4000 \
    --update-freq 8

lematt1991 commented 3 years ago

However, in the bash script on fairseq multilingual translation readme (https://github.com/pytorch/fairseq/tree/master/examples/translation), there is no further reference to either file.

There shouldn't be. As you mention, we train and apply the sentencepiece model to the raw data (it's trained on all text files) and then applied to each text file individually. The spm_encode script (which applies BPE) then creates new text files as outputs: $DATA/train.bpe.${SRC}-${TGT}.${SRC} and $DATA/train.bpe.${SRC}-${TGT}.${TGT}. These are what get passed in to fairseq-preprocess. Does this clarify? Basically you just need to train sentencpiece on all of your input text files, apply the BPE to each text file, and then preprocess using fairseq-preprocess

ever4244 commented 3 years ago

However, in the bash script on fairseq multilingual translation readme (https://github.com/pytorch/fairseq/tree/master/examples/translation), there is no further reference to either file.

There shouldn't be. As you mention, we train and apply the sentencepiece model to the raw data (it's trained on all text files) and then applied to each text file individually. The spm_encode script (which applies BPE) then creates new text files as outputs: $DATA/train.bpe.${SRC}-${TGT}.${SRC} and $DATA/train.bpe.${SRC}-${TGT}.${TGT}. These are what get passed in to fairseq-preprocess. Does this clarify? Basically you just need to train sentencpiece on all of your input text files, apply the BPE to each text file, and then preprocess using fairseq-preprocess

Thank you. I have no problem in learning and applying sentencepiece model. However, I still don't understand how to use sentencepiece with fairseq-preprocess.

In my previous experiment, I use fastBPE. In fastBPE you can obtain a vocab file when you learn the BPE dictionary, so you can provide this vocab.txt to fairseq-preprocess, because the vocab file is also the token and the token frequency format. In this way, you can always get a unified joint dictionary for all the languages by letting setting --tgtdict = vocab:

fairseq-preprocess --source-lang $src --target-lang $tgt \ --trainpref $bpe/train.$src-$tgt \ --joined-dictionary --tgtdict $bpe/vocab \ --destdir $data_bin \ --workers 40

In this way, all the dictionary for different languages (e.g. dict.en dict.de dict.fr) will be the same as they all copy from vocab.txt. That is important for my multilingual translation model.

However, the sentencepiece.bpe.vocab learned by sentencepiece is different from vocab learned by fastBPE. sentencepiece.bpe.vocab is like:

<unk>   0
<s> 0
</s>    0
en  -0
▁d  -1
er  -2
es  -3
on  -4
▁a  -5
in  -6
▁p  -7
▁l  -8
▁s  -9
▁c  -10
ti  -11
▁t  -12
re  -13
▁de -14
is  -15

where the second column is actually the negative id index instead of frequency, so it can not be provided for fairseq-preprocess as input for --tgtdict . So I am wondering:

Where can I get such a unified dictionary (like the vocab.txt in fastBPE) from sentencepiece and provide it to fairseq-preprocess so that all language pairs (e.g. en-de, en-fr, en-es, fr-es, de-es) will have the same dictionary during fairseq-preprocess?

The vocabulary number changes If I don't provide the sentencepiece.bpe.vocab or the vocab.txt from fastBPE in fairseq-preprocess. The dictionary for the different languages would also be different for each fairseq-preprocess run for each different languages pairs. In both fastBPE and sentencepiece, I already obtain an exact 50K joint dictionary. The difference is that I can provide the vocab.txt from fastBPE to fairseq-preprocess but I cannot provide sentencepiece.bpe.vocab to the fairseq-preprocess due to format issue.

There is a similar issue here, I wonder if there are any changes after 2 years. And why do I have to remove <unk>, <s> and </s> from the sentencepiece dictionary. Would there be any negative effect if I let frequency equal to some dummy number (100).

You should probably regenerate the dictionary to get the exact number of units, otherwise you'll have embeddings in your model that won't be trained.

If you want to reuse the sentencepiece dictionary, you can easily convert it to the fairseq format. The main differences are that fairseq uses the format <token> <frequency> (with a space) whereas sentencepiece uses <token>\t<negative_id> (with a tab). Fairseq uses the frequency column to do filtering, so you can simply create a new dictionary with a dummy count of 100 or something. You also need to remove <unk>, <s> and </s> from the sentencepiece dictionary:
cut -f1 sentencepiece.vocab | tail -n +4 | sed "s/$/ 100/g" > fairseq.vocab
I'll also be merging a commit shortly that adds a --remove-bpe=sentencepiece option to generate.py so that you can detokenize the sentencepiece output during generation.

Originally posted by @myleott in https://github.com/pytorch/fairseq/issues/459#issuecomment-458593467

lematt1991 commented 3 years ago

Assuming you learn the BPE tokens on all of your input files, after you apply the BPE using spm_encode, the encoded text files should already be in a common vocab. This way, you don't need to specify a dictionary, to fairseq-preprocess, it will create the dictionary for you.

ever4244 commented 3 years ago

Assuming you learn the BPE tokens on all of your input files, after you apply the BPE using spm_encode, the encoded text files should already be in a common vocab. This way, you don't need to specify a dictionary, to fairseq-preprocess, it will create the dictionary for you.

Thank you very much. I have edited my last reply a little bit:

The vocabulary size changes If I don't provide the sentencepiece.bpe.vocab or the vocab.txt from fastBPE in fairseq-preprocess. The dictionary for the different languages would also be different for each fairseq-preprocess run for each different languages pairs. In both fastBPE and sentencepiece, I already obtain an exact 50K joint dictionary. The difference is that I can provide the vocab.txt from fastBPE to fairseq-preprocess but I cannot provide sentencepiece.bpe.vocab to the fairseq-preprocess due to format issue.

what do you mean by "the encoded text files should already be in a common vocab" If I first fairseq-preprocess train.bpe.de-en then fairseq-preprocess train.bpe.fr-en then fairseq-preprocess train.bpe.en-es then fairseq-preprocess train.bpe.fr-es
then fairseq-preprocess train.bpe.de-es
etc. How can fairseq-preprocess learn a common vocab for all language pairs?. From my previous experiments, dictionaries (dict.en.txt, dict.de.txt dict.es.txt dict.fr.txt) for different language pairs are all different from each other, and their size and content are different from the 50K dictionary I get from fastBPE (vocab) or sentencepiece (sentencepiece.bpe.vocab ).

On the other hand, if I let --tgtdict = vocab (from fastBPE) in my previous experiments, dict.en.txt, dict.de.txt dict.es.txt dict.fr.txt will all equal to vocab.txt

I understand that fairseq-preprocess automatically creat dictionary dict.en.txt, dict.de.txt dict.es.txt dict.fr.txt for each language pairs, but my problem is they are different from each other and their size is not 50K if I do not provide the 50K vocab.txt to it.

sarthmit commented 3 years ago

Hi, were you able to train the multilingual model on the IWSLT-17 pairs?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

sentencepiece in prepare-iwslt17-multilingual.sh example #3513