Closed alvations closed 1 year ago
Hi, thanks for the interest!
Our 7B and 13B models are trained on 20B and 12B tokens, respectively. However, as indicated in the paper, fine-tuning 1B tokens should boost the performance substantially. The steps required to fine-tune 1 billion tokens also vary based on your batch size. In our case, the batch size is calculated as follows: 16 GPUs 4 (batch size per GPU) 4 (gradient accumulation steps) = 256. With a sequence length of 512, we need approximately 8,000 steps to train on 1 billion tokens, calculated as 10^9 / (256*512) ≈8000 steps. However, you may choose to fine-tune more steps to get better performance.
runs
folder, where settings are the same. It will save the best model checkpoint based on the lowest loss on the dev set.We haven't tested ALMA on the A10g yet, but we'll update the information once we do!
Yes, our code is compatible with cross-node training! Just ensure you've configured the DeepSpeed settings correctly for cross-node functionality and have installed the required launcher, such as pdsh
, openmpi
, or mvapich
.
Hi @fe1ixxu , Sorry for disturbing you. My local machine is 8*A100, each of them has 40GB. I pretrained LLaMA-2 7B using the same settings as your settings but it got error overflow as below:
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. Traceback (most recent call last): File “run_llmmt.py”, line 227, in
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. Traceback (most recent call last): File "run_llmmt.py", line 227, in <module> main() File "run_llmmt.py", line 176, in main train_result = trainer.train(resume_from_checkpoint=checkpoint)
How can I solve this problem? Thank you very much!
In the https://github.com/fe1ixxu/ALMA#training- section, there's some instructions to run the
Is there some guidance on what machines and GPUs that the various fine-tuning step should run on? E.g. with
Does the training/fine-tuning scripts also work on multi-node instances?