Closed ghost closed 11 months ago
Hi,
For the 70b model, the training on 8 GPUs finishes in one day, so I would assume on 2 GPUs, it finishes in 4 days, if you increase the GRAD_ACCUMULATION hyper-parameter here from 4 to 16 to main the same global batch size.
For the 7b, it would be much faster, as you can significantly increase the batch size.
Could you please tell me as to how much time would it take to train llama2 on dromedary with A100 gpus(2).