Closed cokuehuang closed 1 year ago
You cannot use iwslt14 data with nllb model. It is because fairseq's models are binded with one or two vocabulary dict.txt. The dict.txt's line counts + special tokens determine a model's input feature size and output feature size.
So prepare-iwslt14.sh
prepares data with a dict.txt made for iwslt14 model.
You need to use nllb's 12800 vocabulary dict.txt to prepare your data.
The command may be like fairseq-preprocess {iwslt data} --srcdict {nllb vocab} --joined-dictionary ....
. Well I suggest you read nllb's page again.
@gmryu Thanks for your replay! I skipped the data preparation to quick test for finetuning, as i don't have my data at the moment. I have another question, if nllb's vocabulary dict.txt doesn't contain some words in my data, can I add new words into nllb's dict.txt and finetuning based on this new dict.txt?
@cokuehuang The total vocabulary size must be the same because, as I wrote before, a model's input feature size and output feature size are determined by given vocabulary size. If vocaburary size is different, you get an error: mismatch between loaded weights and intialized weights.
So, in other words, you can
Hello,
Can anyone please share the code/notebook of NLLB finetuned if possible?
Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.
Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.
Did you get the code for finetuning the NLLB model?
❓ Questions and Help
I want to test finetuning nllb models(3.3B) , I followed the doc in Finetuning NLLB models with this command:
python fairseq/examples/nllb/modeling/train/train_script.py \ cfg=nllb200_dense3.3B_finetune_on_fbseed \ cfg/dataset=fairseq/examples/nllb/modeling/train/conf/cfg/dataset/fbseed_chat.yaml \ cfg.dataset.lang_pairs="deu_Latn-eng_Latn" \ cfg.fairseq_root=fairseq \ cfg.output_dir=nllb_fine_tuned \ cfg.dropout=0.1 \ cfg.warmup=10 \ cfg.finetune_from_model=nllb_models/model_3B/checkpoint.pt
fbseed_chat.yaml as follows: " defaults: - default dataset_name: "fbseed_chat" num_shards: 1 langs_file: "examples/nllb/modeling/scripts/flores200/langs.txt" lang_pairs: "deu_Latn-eng_Latn" data_prefix: localcluster: fairseq/data-bin/iwslt14.tokenized.de-en "files in data folder as follows: `
dict.deu_Latn.txt dict.eng_Latn.txt
test.de-en.de.bin
test.de-en.de.idx
test.de-en.en.bin
test.de-en.en.idx
train.de-en.de.bin train.de-en.de.idx train.de-en.en.bin train.de-en.en.idx valid.de-en.de.bin valid.de-en.de.idx valid.de-en.en.bin valid.de-en.en.idx ` data files are produced by "fairseq/examples/translation/prepare-iwslt14.sh".
After executed finetuning command , erros like this:
Traceback (most recent call last): File "./slurm_snapshot_code/2022-09-05T09_08_29.058828/train.py", line 14, in <module> cli_main() File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 634, in cli_main distributed_utils.call_main(cfg, main) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 371, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 345, in distributed_main main(cfg, **kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 113, in main model = fsdp_wrap(task.build_model(cfg.model)) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/translation_multi_simple_epoch.py", line 246, in build_model return super().build_model(args, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/fairseq_task.py", line 694, in build_model model = models.build_model(args, self, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/__init__.py", line 107, in build_model return model.build_model(cfg, task) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/transformer/transformer_legacy.py", line 112, in build_model raise ValueError("--share-all-embeddings requires a joined dictionary") ValueError: --share-all-embeddings requires a joined dictionary
Need Help! Thanks very much!
Before asking:
What is your question?
Code
What have you tried?
What's your environment?