Finetuning NLLB models with error "ValueError: --share-all-embeddings requires a joined dictionary", need help!

cokuehuang commented 2 years ago

❓ Questions and Help

I want to test finetuning nllb models(3.3B) , I followed the doc in Finetuning NLLB models with this command: python fairseq/examples/nllb/modeling/train/train_script.py \ cfg=nllb200_dense3.3B_finetune_on_fbseed \ cfg/dataset=fairseq/examples/nllb/modeling/train/conf/cfg/dataset/fbseed_chat.yaml \ cfg.dataset.lang_pairs="deu_Latn-eng_Latn" \ cfg.fairseq_root=fairseq \ cfg.output_dir=nllb_fine_tuned \ cfg.dropout=0.1 \ cfg.warmup=10 \ cfg.finetune_from_model=nllb_models/model_3B/checkpoint.pt fbseed_chat.yaml as follows: " defaults: - default dataset_name: "fbseed_chat" num_shards: 1 langs_file: "examples/nllb/modeling/scripts/flores200/langs.txt" lang_pairs: "deu_Latn-eng_Latn" data_prefix: localcluster: fairseq/data-bin/iwslt14.tokenized.de-en "

files in data folder as follows: `

dict.deu_Latn.txt dict.eng_Latn.txt
test.de-en.de.bin
test.de-en.de.idx
test.de-en.en.bin
test.de-en.en.idx
train.de-en.de.bin train.de-en.de.idx train.de-en.en.bin train.de-en.en.idx valid.de-en.de.bin valid.de-en.de.idx valid.de-en.en.bin valid.de-en.en.idx ` data files are produced by "fairseq/examples/translation/prepare-iwslt14.sh".

After executed finetuning command , erros like this: Traceback (most recent call last): File "./slurm_snapshot_code/2022-09-05T09_08_29.058828/train.py", line 14, in <module> cli_main() File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 634, in cli_main distributed_utils.call_main(cfg, main) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 371, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 345, in distributed_main main(cfg, **kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 113, in main model = fsdp_wrap(task.build_model(cfg.model)) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/translation_multi_simple_epoch.py", line 246, in build_model return super().build_model(args, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/fairseq_task.py", line 694, in build_model model = models.build_model(args, self, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/__init__.py", line 107, in build_model return model.build_model(cfg, task) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/transformer/transformer_legacy.py", line 112, in build_model raise ValueError("--share-all-embeddings requires a joined dictionary") ValueError: --share-all-embeddings requires a joined dictionary

Need Help! Thanks very much!

Before asking:

search the issues.
search the docs.

What is your question?

Code

What have you tried?

What's your environment?

fairseq Version (1.0.0a0+f87107c):
PyTorch Version (1.12.1+cu102)
OS (Ubuntu 20.04):
How you installed fairseq (source):
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

gmryu commented 2 years ago

You cannot use iwslt14 data with nllb model. It is because fairseq's models are binded with one or two vocabulary dict.txt. The dict.txt's line counts + special tokens determine a model's input feature size and output feature size.

So prepare-iwslt14.sh prepares data with a dict.txt made for iwslt14 model. You need to use nllb's 12800 vocabulary dict.txt to prepare your data. The command may be like fairseq-preprocess {iwslt data} --srcdict {nllb vocab} --joined-dictionary ..... Well I suggest you read nllb's page again.

cokuehuang commented 2 years ago

@gmryu Thanks for your replay! I skipped the data preparation to quick test for finetuning, as i don't have my data at the moment. I have another question, if nllb's vocabulary dict.txt doesn't contain some words in my data, can I add new words into nllb's dict.txt and finetuning based on this new dict.txt?

gmryu commented 2 years ago

@cokuehuang The total vocabulary size must be the same because, as I wrote before, a model's input feature size and output feature size are determined by given vocabulary size. If vocaburary size is different, you get an error: mismatch between loaded weights and intialized weights.

So, in other words, you can

alter words already in the dict.txt to other words. Words found in the end of dict.txt are probably less frequent(No guranteen). I would alter them first.
edit the checkpoint. (checkout "prune" or "distill") a recent issue about editing a model is here: https://github.com/facebookresearch/fairseq/issues/4664 If you have interest in this, you would need to read till the end.

nithinreddyy commented 1 year ago

Hello,

Can anyone please share the code/notebook of NLLB finetuned if possible?

kchatzi commented 1 year ago

Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.

KarmaCST commented 1 year ago

Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.

Did you get the code for finetuning the NLLB model?

facebookresearch / fairseq