fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
395 stars 29 forks source link

What do i need to add a new language ? #8

Closed MohamedAliRashad closed 12 months ago

MohamedAliRashad commented 12 months ago

First of all, thank you for this great project. My question is simple what do i need to make ALMA Learn a new language (Arabic).

fe1ixxu commented 12 months ago

Hi, thanks for the interest! The answer is quite straightforward. Simply follow the instruction linked here. In the mono_ft.sh bash file, insert ar following --oscar_data_lang and input the sampling probability you intend to use after --interleave_probs. For instance, if your goal is to fine-tune the model with an equal distribution of 50% English (prevent model from forgetting English) and 50% Arabic, you would proceed as follows:

....
--oscar_data_lang en,ar \
--interleave_probs 0.5,0.5 \
....

You can replace meta-llama/Llama-2-7b-hf with haoranxu/ALMA-7B-Pretrain or haoranxu/ALMA-13B-Pretrain to begin with our models. Note that after fine-tuning, the model is not a translation model. You may still need to fine-tune the model on parallel sentences: https://github.com/fe1ixxu/ALMA#parallel-data-fine-tuning-full-weight.