7B pretraining results in OOM; model_parallel=2 can't load official 7B ckpt

miokomioko commented 1 year ago

hardware: 8x A100 80GB

Firstly, I want to express my gratitude for providing the training codes for the Llama2-7B model. I am eager to pretrain the Llama2-7B model using my own corpus. Please note that I am specifically discussing the "PRETRAINING" process, where I utilize the official 7B model as a starting point and then continue training it on additional datasets, such as Falcon, as outlined in your documentation.

However, I have encountered a couple of issues when attempting to use the exps/pretrain/vanilla.sh script with the 7B model. Even when I set the batch size to 1, it results in an out-of-memory (OOM) error. As an alternative, I have tried setting the model_parallel parameter to 2, but this approach fails to load the pretrained 7B model because the 7B checkpoint is saved under a single rank. I would greatly appreciate your assistance in resolving this issue.

The script I use to run pretraining is as follows:

llama_config="$1"
tokenizer_path="$2"
data_meta_path="$3"
data_root="$4"
pretrained_path="$5"

data_parallel=fsdp
model_parallel=2

exp_name="pretrain/vanilla"
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"

torchrun --nproc_per_node=2 --master_port=29100 \
main_pretrain.py \
--output_dir output/"$exp_name" \
--batch_size 1 --accum_iter 1 --num_workers 4 \
--max_words 512 \
--lr 0.0001 --min_lr 0.00001 --warmup_iters 5000 --lr_decay_iters 400000 --clip_grad 2 --weight_decay 0.02 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" \
--llama_type llama --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" \
--data_meta_path "$data_meta_path" --data_root "$data_root" \
--pretrained_path $pretrained_path --checkpointing \
2>&1 | tee -a output/"$exp_name"/output.log

echo "exp name: $exp_name"

ChrisLiu6 commented 1 year ago

Hi, thank you for your recognition of our work!

The command you provide shows that you are training with 2 gpus. To reproduce your issue I've just run the following experiment:

llama_config="$1"
tokenizer_path="$2"
data_meta_path="$3"
data_root="$4"
pretrained_path="$5"

data_parallel=fsdp
model_parallel=1

exp_name="pretrain/vanilla"
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"

torchrun --nproc_per_node=2 --master_port 1112 main_pretrain.py \
--output_dir output/"$exp_name" \
--batch_size 4 --accum_iter 16 --num_workers 4 \
--max_words 2048 \
--lr 0.0001 --min_lr 0.00001 --warmup_iters 5000 --lr_decay_iters 400000 --clip_grad 2 --weight_decay 0.02 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type llama --llama_config $llama_config --tokenizer_path "$tokenizer_path" \
--data_meta_path "$data_meta_path" --data_root "$data_root" \
--pretrained_path "$pretrained_path" \
--pretrained_type meta_ori \
2>&1 | tee -a output/"$exp_name"/output"$RANK".log

echo "exp name: $exp_name"

However, my experiment runs smoothly, and the log shows that peak GPU memory usage is 56990M (77664M shown by nvidia-smi)

Therefore, your OOM error is not expected, and there should be some other unnoticed problems. For example, are you really playing with a 7B model? The number of parameters within the model should be printed out by default, so you may check the log. Many other details are also in the log, and may be helpful for you to debug.

wj210 commented 11 months ago

hi, could i ask whats a rough estimate for the VRAM required to do pretraining for a 7B model? would 4x46GB A6000 be sufficient or at least feasible?

Alpha-VLLM / LLaMA2-Accessory

7B pretraining results in OOM; model_parallel=2 can't load official 7B ckpt #65