fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
352 stars 26 forks source link

[Question] Suggested machine and GPUs to run the training #6

Closed alvations closed 9 months ago

alvations commented 9 months ago

In the https://github.com/fe1ixxu/ALMA#training- section, there's some instructions to run the

Is there some guidance on what machines and GPUs that the various fine-tuning step should run on? E.g. with

Does the training/fine-tuning scripts also work on multi-node instances?

fe1ixxu commented 9 months ago

Hi, thanks for the interest!

How many steps should run on?

Our 7B and 13B models are trained on 20B and 12B tokens, respectively. However, as indicated in the paper, fine-tuning 1B tokens should boost the performance substantially. The steps required to fine-tune 1 billion tokens also vary based on your batch size. In our case, the batch size is calculated as follows: 16 GPUs 4 (batch size per GPU) 4 (gradient accumulation steps) = 256. With a sequence length of 512, we need approximately 8,000 steps to train on 1 billion tokens, calculated as 10^9 / (256*512) ≈8000 steps. However, you may choose to fine-tune more steps to get better performance.

8 * A100

8 * V100 (32G), but only for 7B models (not recommended for full-weight fine-tuning)

A10g

We haven't tested ALMA on the A10g yet, but we'll update the information once we do!

Do the training/fine-tuning scripts also work on multi-node instances?

Yes, our code is compatible with cross-node training! Just ensure you've configured the DeepSpeed settings correctly for cross-node functionality and have installed the required launcher, such as pdsh, openmpi, or mvapich.

vhientran commented 3 months ago

Hi @fe1ixxu , Sorry for disturbing you. My local machine is 8*A100, each of them has 40GB. I pretrained LLaMA-2 7B using the same settings as your settings but it got error overflow as below:

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. Traceback (most recent call last): File “run_llmmt.py”, line 227, in

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Traceback (most recent call last):
File "run_llmmt.py", line 227, in <module>
main()
File "run_llmmt.py", line 176, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)

How can I solve this problem? Thank you very much!