Adding an unseen language to NLLB

sete-nay commented 1 year ago

Hi. I'm trying to finetune NLLB on a new unseen language according to the steps from here and the readme. My source language is a part of NLLB200, but the target language is not included in it. There is also no other language included from the same language family - no related languages I can refer to. What should I set as a target language? Can you refer to an example code adding an unseen language into NLLB?

Thank you!

DROP=0.1 python examples/nllb/modeling/train/train_script.py \ cfg=nllb200_dense3.3B_finetune_on_fbseed \ cfg/dataset=$DATA_CONFIG \ cfg.dataset.lang_pairs="$SRC-$TGT" \ cfg.fairseq_root=$(pwd) \ cfg.output_dir=$OUTPUT_DIR \ cfg.dropout=$DROP \ cfg.warmup=10 \ cfg.finetune_from_model=$MODEL_FOLDER/checkpoint.pt

bt2901 commented 1 year ago

I second this. I am also interested into steps required to add a new language to NLLB model.

avidale commented 4 months ago

Hi! As the Fairseq code for NLLB is not very actively supported, my recipe for adding a new language to the Huggingface implementation of NLLB might be relevant: https://cointegrated.medium.com/a37fc706b865.

facebookresearch / fairseq

Adding an unseen language to NLLB #4841