Closed HashmatShadab closed 2 months ago
Thank you for your interest!
We use 32 A100 GPUs for training. You can reduce the GPU requirements by using gradient accumulation and DeepSpeed ZeRO-3, though this will increase the training time.
In our experience, pretraining takes approximately 2 hours for 0.6 million samples. SFT requires about 24 hours for a 7B model and around 32 hours for a 13B model on 1.8 million samples.
Thanks for sharing the information. The A100 GPU is the one with 80GB memory, right?
Yes, that is correct.
Thanks for sharing your work! Can you please share more details regarding training time for each stage with the resources you have used?