Open LamOne1 opened 1 year ago
Is #devices here the number of GPUs per node or the total number (world_size)?
Per node
I also checked the finetuning script and found that it's calculated differently: num_epochs * epoch_size // micro_batch_size // devices
That was a minor inconsistency. Fixed in #332
Finally, can you please elaborate more on the value of micro_batch_size?
Please see this explanation https://github.com/Lightning-AI/lit-llama/issues/286#issuecomment-1552608587 and the DeepSpeed docs
Shouldn't the value be divided by the world_size instead of the number of GPUs per node?
The variable name is batch_size
but it is used as process_batch_size
as it is used only to define the gradient_accumulation_iters
. I guess it could be renamed to make this clearer.
https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters describes that the global batch size "can be omitted if both train_micro_batch_size_per_gpu and gradient_accumulation_steps are provided".
Thoughts on this @awaelchli?
The variable name is batch_size but it is used as process_batch_size as it is used only to define the gradient_accumulation_iters.
The batch size is the global batch size here, and micro-batch size is the size of the batch of samples that gets assembled and fed through the model. Afaik this corresponds with the terminology others use to and how it is described in the paper, so I highly recommend keeping it this way. The user is free to redefine it. For example, sometimes it is easier to define the per-device batch size and compute the global one as a function of it. It depends on what is more practical.
If an example helps: In the case of the LLaMA, which in their paper reports a batch size of 4M, we would set the value of batch_size
in redpajama.py
to 4M. Then, depending on the GPU memory, the user would tune micro_batch_size to a value that maximizes throughput.
@LamMoh1 The max_iters is an arbitrary number that we inherited from nanoGPT. Realistically, we would set it to infinite and keep training until convergence (or for multi-billion parameter models, until we run out of compute budget). But nobody has done it, so these parameters need to be tuned a bit.
Shouldn't the value be divided by the world_size instead of the number of GPUs per node?
Yes.
Dears @carmocca and @awaelchli,
Thank you so much! Your answer was incredibly helpful and I really appreciate you taking the time to explain things in such a clear and concise way.
Shouldn't the value be divided by the world_size instead of the number of GPUs per node?
Yes.
I think the training code can be updated to change num of devices to world size:
process_batch_size = batch_size // fabric.world_size
I asked about the number of iterations because I want to know when the training will complete one epoch. As I want to know if the model have seen all the data.
Is #devices here the number of GPUs per node or the total number (world_size)?
Per node
@carmocca can you please explain why is it per node and not the world size?
Hi,
I'm using multi-node training and I need to know how to calculate the hyperparameter values in the train_redpajama script. Can you please elaborate more on how to set these values?
Here are the specific values I'm confused about:
max_iters = 600000 # num_epochs * epoch_size // devices
1) Is #devices here the number of GPUs per node or the total number (world_size)? 2) I also checked the finetuning script and found that it's calculated differently:num_epochs * epoch_size // micro_batch_size // devices
I see that the batch size will be divided by the number of iterations later in the code:
process_batch_size = batch_size // devices
3) Shouldn't the value be divided by the world_size instead of the number of GPUs per node?4) Finally, can you please elaborate more on the value of
micro_batch_size
?Thank you,