[Question] Suggested machine and GPUs to run the training

alvations commented 9 months ago

In the https://github.com/fe1ixxu/ALMA#training- section, there's some instructions to run the

monolingual fine-tuning
full fine-tuning parallel data
LORA fine-tuning

Is there some guidance on what machines and GPUs that the various fine-tuning step should run on? E.g. with

single machine, 8xA100
single machine, 8xV100
single machine, A10g (e.g. on colab.google.com)

Does the training/fine-tuning scripts also work on multi-node instances?

fe1ixxu commented 9 months ago

Hi, thanks for the interest!

How many steps should run on?

Our 7B and 13B models are trained on 20B and 12B tokens, respectively. However, as indicated in the paper, fine-tuning 1B tokens should boost the performance substantially. The steps required to fine-tune 1 billion tokens also vary based on your batch size. In our case, the batch size is calculated as follows: 16 GPUs 4 (batch size per GPU) 4 (gradient accumulation steps) = 256. With a sequence length of 512, we need approximately 8,000 steps to train on 1 billion tokens, calculated as 10^9 / (256*512) ≈8000 steps. However, you may choose to fine-tune more steps to get better performance.

8 * A100

monolingual fine-tuning: a single machine of 8*A100 is good enough to train with the same settings as ours, but it needs 16K steps to finish fine-tuning 1B tokens.
stage 2 fine-tuning: one can directly use the bash files under the runs folder, where settings are the same. It will save the best model checkpoint based on the lowest loss on the dev set.

8 * V100 (32G), but only for 7B models (not recommended for full-weight fine-tuning)

monolingual fine-tuning: The batch size per device should be reduced to 1, and the user also needs to enable CPU offloading in DeepSpeed configs.
stage 2, full-weight fine-tuning: The same above.
stage 2, LoRA fine-tuning: One may train the model without CPU offloading.

A10g

We haven't tested ALMA on the A10g yet, but we'll update the information once we do!

Do the training/fine-tuning scripts also work on multi-node instances?

Yes, our code is compatible with cross-node training! Just ensure you've configured the DeepSpeed settings correctly for cross-node functionality and have installed the required launcher, such as pdsh, openmpi, or mvapich.

vhientran commented 3 months ago

Hi @fe1ixxu , Sorry for disturbing you. My local machine is 8*A100, each of them has 40GB. I pretrained LLaMA-2 7B using the same settings as your settings but it got error overflow as below:

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. Traceback (most recent call last): File “run_llmmt.py”, line 227, in
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Traceback (most recent call last):
File "run_llmmt.py", line 227, in <module>
main()
File "run_llmmt.py", line 176, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)

How can I solve this problem? Thank you very much!

fe1ixxu / ALMA