Gradient Accumulation Step under Multi-node Pretaining

Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

https://lightning.ai

Apache License 2.0

8.15k stars 826 forks source link

Gradient Accumulation Step under Multi-node Pretaining #1474

Open SHUMKASHUN opened 3 weeks ago

SHUMKASHUN commented 3 weeks ago

@awaelchli I found that in the pretrain.py, the accumulation steps are calculated based on global batch size, device number and micro batch size. This works fine under single-node setting, e.g. global batch size = 1024, and device number = 8, micro batch size = 16. The gradient accumulation step is just 1024/8/16 = 8. However, it seems the script does not consider the multi-node setting? If I use two node to train, the gradient accumulation step is still 8 (it will still treat device = 8 ) I am wondering should I manually change this accumulation step in the code? Thank you for any suggestions.

rasbt commented 3 weeks ago

Good question, intuitively, I'd say that's a good point. @awaelchli what are your thoughts here? I think you have some experience running pretraining on multi-node.

awaelchli commented 2 weeks ago

The global_batch_size is global across all devices in a machine. It is per machine. We did this out of convenience so that you can first optimize your training for single node and then scale out to multiple without having to change much else. The alternative is to make the global_batch_size global across all devices, and recompute the other values based on that.

In my view, the second approach has more practical disadvantages than the first. For example, I would find it very annoying to choose a value for global batch size that is evenly divisible by the number of devices and micro batch size.

If the current name is a problem, we can also rename the variable.

SHUMKASHUN commented 2 weeks ago

Thank you so much for the explanation. It will be good to add a note about this global_batch_size to be per machine in the readme. Because people may easily keep this global_batch_size unchanged when extend to multiple machines.

awaelchli commented 2 weeks ago

Yes I agree. We could mention it here at least: https://github.com/Lightning-AI/litgpt/blob/76c88950f8bdb59f87ad6a870409f655956e725b/litgpt/args.py#L16-L17 Would you like to do it?

SHUMKASHUN commented 2 weeks ago

Maybe add extra line in config yaml file?

yuzc19 commented 2 days ago

Hi @SHUMKASHUN Thanks for this great question. If I understand correctly, if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

awaelchli commented 1 day ago

if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

Right yes, if you just want to get the same exact results on 8 vs 1 nodes, then you would do that. But of course, there is no practical benefit to that because you would use 8x more resources and not get any speed up in training compared to 1 node.

yuzc19 commented 1 day ago

if I am using 8 nodes, the global_batch_size needs to be divided by 8 of that in one node setting to achieve the similar performance in the same step, right?

Right yes, if you just want to get the same exact results on 8 vs 1 nodes, then you would do that. But of course, there is no practical benefit to that because you would use 8x more resources and not get any speed up in training compared to 1 node.

Thank you! I think it will speed up since the gradient accumulation iters will also be divided by 8 in this case. So for one optimization step, one node will go through fewer iterations.