facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.63k stars 6.42k forks source link

Error in loading state_dict M2M100 during finetuning #4533

Open MathieuGrosso opened 2 years ago

MathieuGrosso commented 2 years ago

🐛 Bug

Hello, I want to finetune M2M100 (and other multilingual models) for research purpose but I have noticed a bug (or maybe just a problem in the readme). When trying to train the model again, I encounter size mismatch with the embedding shape and the state_dict:

Error is : size mismatch for encoder.embed_tokens.weight : copying a param with shape torch.Size([128112,1024]) from checkont, the shape in current model is torch.Size([128104,1024]).

I am trying to understand how to fix this without manually modifying the state dict. Which I know is possible but just does not feel really good. If I modify the state dict then the results will not really be comparable (model is learning a new preprocessing).

Thanks

To Reproduce

Preprocess: fairseq-preprocess --source-lang en --target-lang fr --trainpref {path}/train.en-fr --validpref {path}/valid.en-fr --thresholdsrc 0 --thresholdtgt 0 --destdir {path}/data_bin --srcdict {path}/data_dict.128k.txt --tgtdict {path}/data_dict.128k.txt

training

fairseq-train data_bin --finetune-from-model /models/m2m-418M/model.pt --save-dir /checkpoint --task translation_multi_simple_epoch --encoder-normalize-before --langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' --lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature --sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 --warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big --encoder-layers 12 --decoder-layers 12 --encoder-layerdrop 0.05 --decoder-layerdrop 0.05 --share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d

Environment

gmryu commented 2 years ago

encoder.embed_tokens.weight 's shape is (your srcdict line count) x --encoder-embed-dim (which is 1024 for your arch).

So, you better verify how many lines are inside the dict. The error log tells you the model was trained with 128112 lines srcdict, while is now goint to be trained with 128104 lines dict. Be careful, it is not --srcdict {path}/data_dict.128k.txt, but actually dicts inside your --destdir {path}/data_bin.

If you checked it and two src dicts are exactly the same. Then look carefully at your fairseq-preprocess and fairseq-train log. There is information about vocab sizes written inside. Verify them as well.

If you find anything weird and cannot be solved by reconducting fairseq-preprocess, please give more details here.

MathieuGrosso commented 2 years ago

Hello, First thanks a lot for your quick answer.

So I have 128108 lines in the dict in the data_bin. While there are 128000 lined in the data_dict.128k.txt.

I know the model was trained with 128112 lines, what i dont understand is why using the data dict given by fairseq to preprocess the data does not work when I want to do the training.

Two src and tgt dict are the same number of lines + when looking at the log i see: Fairseq_cli.prepeocess | [en] dictionary :128004 types which does not seem right.

I have seen this problem on other issues and the workaround was always to modify the data_dict given by fairseq but I would like to use the exact dict fairseq used during training to reproduce some results.

I find it weird. I have been during experiments for 3months on this model so I start to understand how it works but can't get why the dict is not working.

For more details: I am using the exact cmd I mentioned but the dict created does not match the expected size. If I extend the data dict then it works but feels weird too.

Last precision: I know that they have added a "model_dict" and i am wondering if thats not what needs to be used for training instead of data dict.

Thanks a lot, let me know if you need more details or precision, Mathieu

gmryu commented 2 years ago

Sorry I am too busy to help you directly. I am a little confused here, as well.

So I have 128108 lines in the dict in the data_bin. While there are 128000 lined in the data_dict.128k.txt.

Why --srcdict {path}/data_dict.128k.txt has 128000? shouldn't it be 128112, the same as the downloaded one=the pretrained one? Is this the first weird thing happened?

Two src and tgt dict are the same number of lines + when looking at the log i see: Fairseq_cli.prepeocess | [en] dictionary :128004 types which does not seem right.

What is this "the same number of lines"? You mean the srcdict has the same line counts as the tgtdict, right? But preprocess can only found 128004 tokens? for both src and tgt?

For more details: I am using the exact cmd I mentioned but the dict created does not match the expected size.

What do you mean by "the dict created"? You are not training one vocabs from your data, right? You are using --srcdict {downloaded dict.txt from fairseq}, right? and the dict.txt inside data-bin is strange ?

Do I understand correctly?

--

Assume my assumptions all above are right, In my opinion, you may better

  1. A Direct Approach: debug {fairseq repository}/fairseq_cli/preprocess.py and {fairseq repository/fairseq/data/dictionary.py}

Those are how vocabs are instantiated during the runtime.

  1. May Be A Clue? : write a python code that compares two dictionary. First open one src vocabs and use python dict to record all tokens(keys), one per line. Add a if(key in dict) to check if there were duplicates before recording dict[key]=line index. Then open the other src vocabs, for each line: if key in dict, delete dict[key], else print(this key). That print will tell you extra keys in the 2nd dict andprint(python dict) will tell you extra keys found inside the 1st dict.
MathieuGrosso commented 2 years ago

Hello no worries, thanks a lot for your help. And don't worry if you don't have time to answer.

So first I don't really know why the data dict given by fairseq has only 128000 lines. There is also a model dict of 128112 lines, so I have assumed for now that we need to use this to train the model unlike what is written in the readme. What i have seen by comparing both dict is that they are the same until line 128000 but then the last 112 lines in model dict are like this: en 1 ro1 ....

What is this "the same number of lines"? You mean the srcdict has the same line counts as the tgtdict, right?

But preprocess can only found 128004 tokens? for both src and tgt?

Yes that's exactly what i mean, both tgt and src dict have the same number of lines but then they are just 128004 tokens (what they found in the data dict)

What do you mean by "the dict created"? You are not training one vocabs from your data, right?

You are using --srcdict {downloaded dict.txt from fairseq}, right? and the dict.txt inside data-bin is strange ?

Yes yoi have understand correctly, i am using the downloaded dict and creating the dict.txt.$src and dict.txt.$tgt thanks to the fairseq preprocess command.


  1. I have looked to the python files you sent, but i think the problem comes from the length of the dictionnary given by fairseq

  2. I will try what you say since I believe it coule really help thanks.

In any case, thanks a lot I will let you know if I manage to make it work properly.


Last comment: There are 2 dictionnaries available to download: For génération the preprocessing works well with data_dict.128k.txt and you have to add --fixed_dictionary model_dict.128k.txt when using fairseq-generate. On the same data it does not work with fairseq-train. I have tried two things :

  1. Using the option fixed dictionary with fairseq-train but does not work
  2. Preprocessing the data with model_dict.128k.txt instead of data_dict, and it seems to work but not sure if thats the right manner to do.

Thanks have a nice day

gmryu commented 2 years ago

@MathieuGrosso Hi, I know what is wrong now. There are 2 ways of solving it.

  1. An easy way is you use model_dict.128k.txt in all of your steps (preprocess,train, generate, interactive...).

If you used it in preprocess with --srcdict model_dict.128k.txt, --fixed-dictionary are not necessary, no special treatment in other steps as well. It is absolutely not intended, but it works flawlessly for this fairseq now.

--

  1. A detailed story is:

When you are using translation_multi_simple_epoch task, it will call MultilingualDatasetManager.prepare to load dict.txt . Then you look at that prepare, you notice augment_dictionary( dictionary=d, ... , language_list=language_list, ... ).

So the dict is altered during the runtime, and it will recieve special lang token like __en__ for english, __fr__ for french, etc. thus making the input dim = line count in the dict + <s> <pad> </s> <unk> + all lang tokens.

The confusing part is m2m_100 is actually pretrained with 100 languages + 8 madeupwords, regardless of that "100". (why would they do that?) So m2m_100 input dim=128000+4+108=128112 as stated in your error log. Your --langs 'af,am,...,zh,zu' have 100 languages, which implies the 128104 written in your error, too. You have to add that 8 madeupwords. Well, you do not have to pass the exact madeupwords as those inside model_dict.128k.txt (madeupwordforbt,madeupword0000~6, in total 8 of them) Some random symbols that would never appear in any of your use case will do fine.

--

Actually found this by creating a tiny model(like 10M), and there is no error popping except some tiny issues due to I am using windows and cpu. Then I read your error log again and confirmed this truth. ( I thought it last time and thats why I encouraged to compare dicts. Should have written down my assumption last time, my fault.)

MathieuGrosso commented 2 years ago

Hello, Thanks a lot for your answer, it really makes sense like this.

I don't know why the model dict is working flawlessly but it indeed seems to coincide with what you said. I don't really get why fairseq decided to train m2m_100 with these madeupwords tokens but this is definetely what is mission. THe difference is exactly of 8 tokens. So make sense to add them to the data dict + to specificy the langs.

I also had no issue training with MBART50 with the same translation_multi_simple_epoch so yes it comes from M2M_100. It would have been interesting to know why they have added these tokens but I don't think it would change anything.

Your answer really helped, no worries for the assumption. Checking the dict length allowed me to understand the difference between model_dict and data_dict.

Have a nice day ! I will close the issue :)