How to Finetune fairser M2M 100 Model for a Language ?

ShihabYasin commented 3 years ago

I want to finetune existing M2M 100 models (418M, 1.2B, 12B) for a particular language pair translation say (bn-en).

Which script(s) or training sequence/tutorial/doc should I follow. Currently trying this script, but getting below error.

What will be the param choice for --arch ?

ShihabYasin commented 3 years ago

Is there any update @shruti-bh by ?

jaspock commented 3 years ago

Same problem here. I don't know which --arch and --task to use. Using Fairseq 0.10.2 the closer I seem to get after trying different combinations of --arch (multilingual_transformer, mbart_large, transformer...) and --task (translation_multi_simple_epoch, multilingual_translation) is:

fairseq-train ./data_bin --finetune-from-model ./418M_last_checkpoint.pt --save-dir ./checkpoint --arch mbart_large --task translation_multi_simple_epoch --layernorm-embedding --encoder-normalize-before --langs $(cat langs.txt) --lang-pairs "en-es,es-en" --decoder-normalize-before --sampling-method "temperature" --sampling-temperature 1.5 --encoder-langtok "src" --decoder-langtok --max-tokens 768 ...

But still I get errors:

Missing key(s) in state_dict: "encoder.embed_positions.weight", "encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.embed_positions.weight", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias". 

Unexpected key(s) in state_dict: "encoder.embed_positions._float_tensor", "decoder.embed_positions._float_tensor". 

Cannot load model parameters from checkpoint... please ensure that the architectures match.

jaspock commented 3 years ago

@vitaka kindly showed me how to inspect the model and obtain the training parameters. TL;DR: M2M 418M uses the architecture transformer_wmt_en_de_big, and M2M 12B uses transformer_wmt_en_de_big_pipeline_parallel.

In order to know this parameter or a different one for other checkpoints, run:

import torch
import json
import argparse

parser= argparse.ArgumentParser();
parser.add_argument("--model",help="Model path")

if __name__ == '__main__':
    args= parser.parse_args()
    checkpoint= torch.load(args.model)
    par_dict= vars(checkpoint['args'])
    print(json.dumps(par_dict, indent=2, sort_keys=True))

nikhiljaiswal commented 3 years ago

Were you able to find the set of parameters? I am still trying to figure it out. I tried this but still the architecture does not match -

fairseq-train $path_2_data \ --finetune-from-model $pretrained_model \ --max-epoch 500 \ --ddp-backend=legacy_ddp \ --task translation_multi_simple_epoch \ --lang-pairs de-en,en-de \ --arch transformer_wmt_en_de_big \ --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' \ --lr 0.0005 --lr-scheduler inverse_sqrt \ --warmup-updates 4000 --warmup-init-lr '1e-07' \ --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \ --dropout 0.3 --weight-decay 0.0001 \ --max-tokens 4000 \ --update-freq 8

nikhiljaiswal commented 3 years ago

Hi @jaspock @ShihabYasin @shruti-bh please help me in selecting the correct parameters. I am getting following error:

Missing key(s) in state_dict: "decoder.output_projection.weight".

My parameters are: arch='transformer_wmt_en_de_big' task='translation_multi_simple_epoch'

jaspock commented 3 years ago

There are three M2M versions with different number of parameters. fairseq-train currently requires that you indicate all the model parameters (embedding sizes, number of layers, etc.) whose default values do not match those actually used in the architecture of the model you are loading. The parameters are already stored in the downloaded model (see my script above to list them), but Fairseq first creates an initial model of the architecture indicated with --arch and, after that, it copies the parameters from the loaded model into the initial model; if there is some architectural difference, the model won't be loaded. So, you have to carefully review all the parameters in the downloaded model (again, you can use my script) and copy the relevant keys and values as command line parameters of fairseq-train so that they are used in the initial model. After some trial and error, the following (see below) has worked for me.

Notice that my examples include some parameters which do not define the architecture, e.g. optimizer staff, and may be adapted to your particular task. Most of the parameters related to the model architecture are indicated after --arch in my examples. Also, I renamed all the models to model.pt; change the file name on --finetune-from-model if necessary.

Although there are no errors, sadly, however, I haven't managed to fine-tune the largest 12B model in spite of having 2x40GB GPUs. I have some memory issues when training is about to start. I'm still working on that (let me know if you have some progress on this), but the model is successfully loaded, which answers your question :-)

Fine-tuning of M2M 418M:

fairseq-train data_bin --finetune-from-model /models/m2m-418M/model.pt --save-dir /checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big --encoder-layers 12 --decoder-layers 12 --encoder-layerdrop 0.05 --decoder-layerdrop 0.05 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d

Fine-tuning of M2M 1.2B:

fairseq-train data_bin --finetune-from-model /models/m2m-1.2B/model.pt --save-dir /checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big --encoder-layers 24 --decoder-layers 24 --encoder-ffn-embed-dim 8192 --decoder-ffn-embed-dim 8192 --encoder-layerdrop 0.05 --decoder-layerdrop 0.05 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d

Fine-tuning of M2M 12B:

fairseq-train data_bin --finetune-from-model /models/m2m-12B-2GPU/model.pt --save-dir /checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big_pipeline_parallel --encoder-layers 24 --decoder-layers 24 --encoder-attention-heads 16 --decoder-attention-heads 16 --encoder-ffn-embed-dim 16384 --decoder-ffn-embed-dim 16384 --decoder-embed-dim 4096 --encoder-embed-dim 4096 --num-embedding-chunks 2 --pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]' --fp16 --dataset-impl mmap --pipeline-chunks 1 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d --clip-norm 1.0

I hope this helps!

nikhiljaiswal commented 3 years ago

Hi @jaspock , thanks for the detailed answer. I tried to run using the suggestions provided by you, the earlier issue was resolved, but I am getting embedding size mismatch error:

RuntimeError: Error(s) in loading state_dict for TransformerModel: size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([128112, 1024]) from checkpoint, the shape in current model is torch.Size([128104, 1024]).

can you suggest what could be wrong with this?

I am trying to finetune the m2m100 418M model on lang_pairs="de-en,en-de"

for creating the spm file, I have followed the script from https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh

and for creating the binarized file, I have followed the steps from here: https://github.com/pytorch/fairseq/tree/master/examples/translation

TEXT=/home/nikhil/workspace//5_document_mt/eng_jap_nmt/evaluation_tools/multilingual_N/IBT/data fairseq-preprocess --source-lang de --target-lang en \ --trainpref $TEXT/spm.train.de-en \ --validpref $TEXT/spm.valid.de-en \ --destdir $TEXT/bin_data/med.de_en \ --thresholdsrc 0 --thresholdtgt 0 \ --srcdict /home/nikhil/workspace/5_document_mt/eng_jap_nmt/evaluation_tools/multilingual_N/data_dict.128k.txt \ --tgtdict /home/nikhil/workspace/5_document_mt/eng_jap_nmt/evaluation_tools/multilingual_N/data_dict.128k.txt

fairseq-preprocess --source-lang en --target-lang de \ --trainpref $TEXT/spm.train.en-de \ --validpref $TEXT/spm.valid.en-de \ --destdir $TEXT/bin_data/med.de_en \ --thresholdsrc 0 --thresholdtgt 0 \ --srcdict /home/nikhil/workspace/5_document_mt/eng_jap_nmt/evaluation_tools/multilingual_N/data_dict.128k.txt \ --tgtdict /home/nikhil/workspace/5_document_mt/eng_jap_nmt/evaluation_tools/multilingual_N/data_dict.128k.txt

except that I have not trained the BPE model, instead, I have used the srcdict & srcdict parameters of as data_dict.128k.txt which is from the trained m2m model

jaspock commented 3 years ago

The error message you get indicates that the number of rows of the embedding matrix/table in the initial model (128104) does not match the corresponding value in the loaded model (128112). As I said before, you have to carefully ensure that all parameters match. In this case, however, it is not the number of columns which is wrong (that is, the embedding size for which a parameter exists), but the number of rows (that is, the vocabulary size). Maybe there is some missing command-line parameter that automatically increments the vocabulary with one extra token for each language (these language tokens are necessary to indicate the encoder and decoder which language to process, and actually I assume that this is why you need to provide them in the --lang command-line argument), but I couldn't find it. What I did is to manually extend the data_dict.128k.txt file in the downloaded model by appending the lines in my file dict_extension_langs_m2m.txt.

The original 128000 lines now become 128108. It turns out that fairseq automatically adds 4 extra tokens (end of sentence, unknown word...) which results in the expected 128112 tokens in the vocabulary. Note that things could go wrong if, for example, the languages on my list (in lexicographical order) are not in the same order as that considered during pre-training, because the index corresponding to the languages you indicate with --lang-pairs would not match those used during pre-training. In my case, I am using M2M to generate both English and Pashto, and the system is generating English and Pashto, so for me it is working ;-) I still have some memory issues when trying to fine-tune the 12B model, though :-(

A last point is how I knew that language codes are in the form __en__. Well, if you append different codes (say, in the form +en+) and then ask M2M to generate English, it will complain that no index is found for token __en__, which will give you a hint.

@shruti-bh, is the downloadable M2M dictionary incomplete or am I missing something?

@nikhiljaiswal, tell me about your progress.

maroxtn commented 3 years ago

@jaspock Thanks for your input, it helped make the training script work. However the problem I am facing now is that the loss is barely converging, despite the small size of the dataset I am using (10k). The language pair is not common (english - yoruba), but a tiny transformer would perform better.

This leads me to that the tokens are not mapped to the right embeddings. What do you think ?

maroxtn commented 3 years ago

An update, I have a working training script here: https://github.com/maroxtn/mt5-M2M-comparison

nikhiljaiswal commented 2 years ago

@jaspock , I was able to finetune using your help. I have another query. I want to finetune m2m100 model on set of language pair instead of a single pair. For e.g I have datasets in following form: ja-en,fr-en,ru-en . Can I combine these datasets and finetune m2m100 directly on this. Please help.

jaspock commented 2 years ago

@nikhiljaiswal , glad to see that my comments were useful to you. You can train under a multilingual setting by adding these arguments that you can also find in my command-line examples above: --task translation_multi_simple_epoch --encoder-langtok src --decoder-langtok --lang-pairs "ja-en,fr-en,ru-en". This will activate some features such as appending a language token representing the source before the input sentence and another token representing the target language before the decoder auto-regressive/teacher-forcing input. The pre-trained model was trained with these arguments so you must use them during inference/fine-tuning as well. By looking at Fairseq's code for this task you will see that it basically performs the following loop as stated in the comments:

for i in range(len(epoch)):
    for lang_pair in args.lang_pairs:
        batch = next_batch_for_lang_pair(lang_pair)
        loss = criterion(model_for_lang_pair(lang_pair), batch)
        loss.backward()

nikhiljaiswal commented 2 years ago

Thanks @jaspock for the answer, it really helped me. I had one doubt, When I will prepare the dataset, suppose I need to finetune on ja-en & de-en, then after i create the preprocessed file, I store the data in ja-en folder and de-en folder. Now, I think I need to combine contents of both folder in a single folder say bin and pass that folder as data path to the training command, right? Now my doubt was, since en dict would have been created in both the ja-en folder and de-en folder, after combining there will be a single en dictionary right. Is this correct way or am I missing something?

juncaofish commented 2 years ago

import torch import json import argparse

parser= argparse.ArgumentParser(); parser.add_argument("--model",help="Model path")

if name == 'main': args= parser.parse_args() checkpoint= torch.load(args.model) par_dict= vars(checkpoint['args']) print(json.dumps(par_dict, indent=2, sort_keys=True))

hi @jaspock , I check the 1.2B model with your script, and find the model doesn't have language specific sparse layer as described in the m2m-100 paper. Could you kindly verify it please?

jaspock commented 2 years ago

@nikhiljaiswal, notice that you must stick to the dictionary that was used during pretraining as the neural model's vocabulary consists of these words. This is why you use --srcdict and --tgtdict in fairseq-preprocess and make them both link to the dictionary model_dict.128k.txt (a single file as expected in a multilingual setting) that you downloaded along with the model; these options basically mean: "simply create the binary representation of the corpora; don't create new dictionaries but use the provided ones". After that, you can use --fixed-dict for generation and (not completely sure about this one) training, indicating again the dictionary file model_dict.128k.txt.

A vocabulary of 128k words implies a large embedding table. If you fine-tune/run your system with a reduced number of languages, a significant part of them will never be used. Read issue #2120 to find a couple of scripts that will allow you to trim the non-used embeddings from the table. The influence on the checkpoint size will depend on the size of the model: for smaller models, it will be more noticeable.

jaspock commented 2 years ago

Hi, @juncaofish. I took no part in the development or release of the M2M-100 model. I am just a humble researcher that tried to fine-tune it and that occasionally contributes to this discussion. That said, I see that in the paper language-specific parameters are introduced in section 5.2, where it is stated that "language-specific layer adds 3.4B [...] The total size of this model is 15.4B parameters". As the experiments in section 5.1 (before the introduction of language-specific parameters) involve models with sizes 418M, 1.2B and 12B which match those of the models you can download, I deduce that the released models are the dense ones before sparsity is considered.

nikhiljaiswal commented 2 years ago

@jaspock were u able to finetune 12b model?

jaspock commented 2 years ago

No, @nikhiljaiswal. I found that the inital mBART50 model (with no fine-tuning) was giving similar or better BLEU scores than the12B M2M for my languages of interest and decided to fine-tune mBART50. If you have English as source or target language, it may be worth trying.

nikhiljaiswal commented 2 years ago

@jaspock thanks for the suggestion. I will use the initial mBART50 model but I also need to finetune on my data. Do you have the script to finetune mBART50 model?

ajesujoba commented 2 years ago

Hi @jaspock , I keep on getting AssertionError: cannot find language token __en__ in the dictionary during inference. Do you know why?

jaspock commented 2 years ago

@nikhiljaiswal, the mBART50 models have the same issue with the provided dictionaries as the one indicated here for M2M. Therefore, you need to extend the downloaded dictionary with language and padding tokens. Note that language tokens are slightly different in this case (en_XX for mBART50 instead of __en__ as in M2M, for example). Find here the lines that you have to append to the downloaded dictionary: dict_extension_langs_mbart50.txt. You will also need a file with the list of languages: langs_mbart50.txt.

Find below the commands to fine-tune mBART50 as well. Note that as already explained here for M2M, you have to carefully check that all the mBART50 parameters that do not have the default Fairseq values are explicitly added to fairseq-train.

As usual, first tokenize and binarize train, dev and test sets:

for dataset in train valid test
do
  python spm_encode.py \
    --model sentence.bpe.model \
    --output_format=piece \
    --inputs=${SRC}.${dataset} \
    --outputs=spm.${dataset}.${SRC}-${TRG}.${SRC}

  python spm_encode.py \
    --model sentence.bpe.model \
    --output_format=piece \
    --inputs=${TRG}.${dataset} \
    --outputs=spm.${dataset}.${SRC}-${TRG}.${TRG}
done

fairseq-preprocess \
    --source-lang ${SRC} --target-lang ${TRG} \
    --testpref spm.test.${SRC}-${TRG} \
    --validpref spm.valid.${SRC}-${TRG} \
    --trainpref spm.train.${SRC}-${TRG} \
    --thresholdsrc 0 --thresholdtgt 0 \
    --destdir data_bin_${SRC}_${TRG} \
    --srcdict extended_dict.txt \
    --tgtdict extended_dict.txt

The test set can be translated with the plain non-fine-tuned model with:

fairseq-generate \
    data_bin_${SRC}_${TRG} \
    --max-tokens 2000 \
    --path mbart50.ft.nn \
    --fixed-dictionary extended_dict.txt \
    -s ${SRC} -t ${TRG} \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs "${SRC}-${TRG},${TRG}-${SRC}" \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --dataset-impl mmap \
    --distributed-world-size 1 \
    --distributed-no-spawn \
    --results-path generation_${SRC}_${TRG}

Now, fine-tune to adapt the model to your language pair of interest (adapt those parameters that are not architecture-dependent to your needs):

fairseq-train \
    data_bin_${SRC}_${TRG} \
    --max-tokens 2000 \
    --finetune-from-model mbart50.ft.nn \
    --task translation_multi_simple_epoch \
    --langs $(cat langs_mbart50.txt) \
    --lang-pairs "${SRC}-${TRG},${TRG}-${SRC}" \
    --patience 10 \
    --save-dir checkpoint \
    --save-interval-updates 300 \
    --validate-interval-updates 300 \
    --keep-interval-updates 1 \
    --best-checkpoint-metric "loss" \
    --keep-best-checkpoints 1 \
    --keep-last-epochs 1 \
    --save-interval 9999999 \
    --validate-interval 9999999 \
    --no-epoch-checkpoints \
    --encoder-langtok src \
    --decoder-langtok \
    --decoder-attention-heads 16 \
    --decoder-layerdrop 0 \
    --decoder-normalize-before \
    --encoder-attention-heads 16 \
    --encoder-layerdrop 0 \
    --encoder-normalize-before \
    --activation-dropout 0.0 \
    --activation-fn relu \
    --optimizer "adam" \
    --adam-betas "(0.9, 0.98)" \
    --adam-eps 1e-08 \
    --adaptive-softmax-dropout 0 \
    --attention-dropout 0.1 \
    --clip-norm 0.0 \
    --criterion "label_smoothed_cross_entropy" \
    --dataset-impl "mmap" \
    --dropout 0.1 \
    --ddp-backend "c10d" \
    --fp16 \
    --label-smoothing 0.1 \
    --sampling-method temperature \
    --sampling-temperature 1.5 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.2 \
    --lang-tok-style "multilingual" \
    --required-batch-size-multiple 8 \
    --share-all-embeddings \
    --share-decoder-input-output-embed \
    --weight-decay 0.0 \
    --no-epoch-checkpoints \
    --lr-scheduler inverse_sqrt \
    --lr 0.0006 \
    --seed 222 \
    --no-progress-bar \
    --log-format "simple" \
    --log-interval 1 \
    --arch mbart_large \
    --layernorm-embedding \
    --adam-eps 1e-06 \
    --warmup-updates 2500 \
    --max-update 40000 \
    --dropout 0.3 \
    --attention-dropout 0.1 2>&1 | tee train.log

Finally, note that the many-to-one mBART50 model may require a small fix.

jaspock commented 2 years ago

@ajesujoba, not sure what is happening. Have you double checked that you are using the lines in the extension to the M2M dictionary file added at the end of the downloaded dictionary?

ajesujoba commented 2 years ago

Thank you @jaspock , do I add those lines to dict both for both training and inference? I am not sure about the training tho

jaspock commented 2 years ago

Use --srcdict extended_dict.txt --tgtdict extended_dict.txt in fairseq-preprocess and --fixed-dictionary extended_dict.txt in fairseq-generate or fairseq-interactive. The file extended_dict.txt is the original dictionary with the special tokens appended. As training works with the data binarized (in which tokens have already been replaced with indexes) by fairseq-preprocess, you don't have to add any dictionary-related parameter to fairseq-train.

nikhiljaiswal commented 2 years ago

Hi @jaspock, I want to finetune a pre-trained model on a custom dataset that contains some new tags for e.g , etc. These tags were not present when I trained the spm model while training the base model. Hence such tags are not present in the spm vocabulary list as well as in the dictionary. How can we add these new tokens to the spm vocab as well as dictionary and initialize the tokenizer embeddings?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

MathieuGrosso commented 2 years ago

Hello @maroxtn ! I have tried to fine tune M2M too. I had the same issue as you when using @jaspock command for m2m 418 with the embedding size. So i have added the script to the data dict to make it work.

The problem is that my loss is not decreasing and not converging. I am wondering if this could be caused by the data dict file. Have you extended the data dict to make your training work ? As proposed by @jaspock .

Thanks a lot Mathieu

abdulrafae commented 2 years ago

@jaspock Your script helped me finetune the M2M models. Thank you. However, like in your case, I am also not able to finetune the 12B model. I am using 2 32 GB GPUs and the 2 GPU pretrained checkpoint with the pipeline parameters set as --pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]'. I see that the model tries to load in the 2 GPUs however then it gives an error. The error I am getting is hydra.errors.ConfigCompositionException: Error merging override distributed_training.ddp_backend='simple'. For me the 1.2B M2M is getting better performance than MBART-50 so I am trying to run the same on the 12B model. Were you able to finetune the M2M 12B model? Thanks for the help.

jaspock commented 2 years ago

@abdulrafae, I gave up trying to make the larger models run. I could have tried to test with different CUDA versions, but did not find time to play with that. Is English your source or target language? If not, that could explain that M2M works better for you as M2M is not English-centric as mBART50. BTW, note that we have at least a new model in town that could be worth exploring: NLLB-200.

abdulrafae commented 2 years ago

Thanks for the help. I will have a look at this new model. By the way I have English as the target, but I think perhaps I am finetuning the pretrained mBART-50 and not the many-to-one model therefore M2M is performing better.

FadedCosine commented 1 year ago

Same problem here. I don't know which --arch and --task to use. Using Fairseq 0.10.2 the closer I seem to get after trying different combinations of --arch (multilingual_transformer, mbart_large, transformer...) and --task (translation_multi_simple_epoch, multilingual_translation) is:

fairseq-train ./data_bin --finetune-from-model ./418M_last_checkpoint.pt --save-dir ./checkpoint --arch mbart_large --task translation_multi_simple_epoch --layernorm-embedding --encoder-normalize-before --langs $(cat langs.txt) --lang-pairs "en-es,es-en" --decoder-normalize-before --sampling-method "temperature" --sampling-temperature 1.5 --encoder-langtok "src" --decoder-langtok --max-tokens 768 ...

But still I get errors:
Missing key(s) in state_dict: "encoder.embed_positions.weight", "encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.embed_positions.weight", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias". 

Unexpected key(s) in state_dict: "encoder.embed_positions._float_tensor", "decoder.embed_positions._float_tensor". 

Cannot load model parameters from checkpoint... please ensure that the architectures match.

@jaspock I met the same error. Can you kindly tell me how do you solve this error?

facebookresearch / fairseq

How to Finetune fairser M2M 100 Model for a Language ? #3233

I want to finetune existing M2M 100 models (418M, 1.2B, 12B) for a particular language pair translation say (bn-en).

What will be the param choice for --arch ?