explanation of the hyperparameters in the pretraining script

Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

Apache License 2.0

5.96k stars 517 forks source link

explanation of the hyperparameters in the pretraining script #330

Open LamOne1 opened 1 year ago

LamOne1 commented 1 year ago

Hi,

I'm using multi-node training and I need to know how to calculate the hyperparameter values in the train_redpajama script. Can you please elaborate more on how to set these values?

Here are the specific values I'm confused about:

max_iters = 600000 # num_epochs * epoch_size // devices 1) Is #devices here the number of GPUs per node or the total number (world_size)? 2) I also checked the finetuning script and found that it's calculated differently: num_epochs * epoch_size // micro_batch_size // devices

I see that the batch size will be divided by the number of iterations later in the code: process_batch_size = batch_size // devices 3) Shouldn't the value be divided by the world_size instead of the number of GPUs per node?

4) Finally, can you please elaborate more on the value of micro_batch_size?

Thank you,

carmocca commented 1 year ago

Is #devices here the number of GPUs per node or the total number (world_size)?

Per node

I also checked the finetuning script and found that it's calculated differently: num_epochs * epoch_size // micro_batch_size // devices

That was a minor inconsistency. Fixed in #332

Finally, can you please elaborate more on the value of micro_batch_size?

Please see this explanation https://github.com/Lightning-AI/lit-llama/issues/286#issuecomment-1552608587 and the DeepSpeed docs

Shouldn't the value be divided by the world_size instead of the number of GPUs per node?

The variable name is batch_size but it is used as process_batch_size as it is used only to define the gradient_accumulation_iters. I guess it could be renamed to make this clearer.

https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters describes that the global batch size "can be omitted if both train_micro_batch_size_per_gpu and gradient_accumulation_steps are provided".

Thoughts on this @awaelchli?

awaelchli commented 1 year ago

The variable name is batch_size but it is used as process_batch_size as it is used only to define the gradient_accumulation_iters.

The batch size is the global batch size here, and micro-batch size is the size of the batch of samples that gets assembled and fed through the model. Afaik this corresponds with the terminology others use to and how it is described in the paper, so I highly recommend keeping it this way. The user is free to redefine it. For example, sometimes it is easier to define the per-device batch size and compute the global one as a function of it. It depends on what is more practical.

If an example helps: In the case of the LLaMA, which in their paper reports a batch size of 4M, we would set the value of batch_size in redpajama.py to 4M. Then, depending on the GPU memory, the user would tune micro_batch_size to a value that maximizes throughput.

@LamMoh1 The max_iters is an arbitrary number that we inherited from nanoGPT. Realistically, we would set it to infinite and keep training until convergence (or for multi-billion parameter models, until we run out of compute budget). But nobody has done it, so these parameters need to be tuned a bit.

Shouldn't the value be divided by the world_size instead of the number of GPUs per node?

Yes.

LamOne1 commented 1 year ago

Dears @carmocca and @awaelchli,

Thank you so much! Your answer was incredibly helpful and I really appreciate you taking the time to explain things in such a clear and concise way.

Shouldn't the value be divided by the world_size instead of the number of GPUs per node?

Yes.

I think the training code can be updated to change num of devices to world size: process_batch_size = batch_size // fabric.world_size

I asked about the number of iterations because I want to know when the training will complete one epoch. As I want to know if the model have seen all the data.

Is #devices here the number of GPUs per node or the total number (world_size)?

Per node

@carmocca can you please explain why is it per node and not the world size?