fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
440 stars 35 forks source link

[Question] Replicating ALMA by training from scratch #34

Closed alvations closed 8 months ago

alvations commented 8 months ago

Thank you for sharing the ALMA / ALMA-R models and fine-tuning scripts!

Hypothetically, if we want to replicate the ALMA experiments by training it from scratch, how can we do it?

We have some questions on how we can train an ALMA model from scratch: 1) Which monolingual datasets were used to train the original ALMA model? 2) Which bitext datasets were used LORA fine-tuning? 3) Were any of the previous WMT datasets used in the released ALMA / ALMA-R models? esp. dev/test sets 4) When we use mono_ft.sh, the default is model is set to AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf'). In that case, if we were to train a ALMA from scratch, we'll don't really have to change anything in mono_ft.sh other than the data from question (1), is that correct? 5) And if we want to totally train even the base llama model from scratch, we'll have to do something like https://discuss.huggingface.co/t/how-does-one-reinitialize-the-weights-of-a-hugging-face-llama-v2-model-the-official-way-as-the-original-model/62547, is that right?

Thank you in advance for the answers!

fe1ixxu commented 8 months ago

Hi, Thanks for your interest and sorry about the delayed response.

  1. The monolingual used is OSCAR data, as shown in here.
  2. They are located here
  3. The bitext data used by ALMA for training is WMT'17-20 test data + Flores200 dev+test data
  4. If you want to train a new ALMA from scratch, you do not need to change anything in mono_ft.sh
  5. I am not sure, but the simplest way is to run AutoModelForCausalLM.from_config(config) https://github.com/fe1ixxu/ALMA/blob/b92304c6548b7d0af0cdadca9d63c07c70d19cd5/utils/utils.py#L364
alvations commented 8 months ago

Thank you for the clarification!!