llama2 13b out of memory on A800

xdhhh commented 1 year ago

When i try to finetune llama2-13b model on model with alpaca format, i always get cuda OOM error even with batch_size == 1

torchrun --master_port=1112 --nproc_per_node=2 main_finetune.py \ --output_dir output/"$exp_name" --epochs 5 --warmup_epochs 1 \ --batch_size 1 --accum_iter 2 --num_workers 2 \ --max_words 512 \ --lr 3e-5 --min_lr 3e-6 --clip_grad 2 --weight_decay 0.02 \ --data_parallel "$data_parallel" --model_parallel_size 2 --checkpointing \ --llama_type llama --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" \ --no_visual \ --pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \ --data_config $data_config \ --precision bf16 \ 2>&1 | tee -a output/"$exp_name"/output.log

linziyi96 commented 1 year ago

If you are running full fine-tuning, it might be that 2 GPUs are not enough: A 13b model will need about 156GB GPU memory solely for optimizer states (4-bit master weight + 8-bit Adam momentum). I guess it will need at least 4 A800 80GB GPUs to run full fine-tuning. You may try to add more GPUs or try some parameter-efficient tuning methods instead (e.g., using the example scripts at https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/exps/finetune/sg/alpaca_llamaPeft_normBiasLora.sh )

On Fri, Aug 11, 2023 at 4:51 PM xdhhh @.***> wrote:

When i try to finetune llama2-13b model on model with alpaca format, i always get cuda OOM error even with batch_size == 1

torchrun --master_port=1112 --nproc_per_node=2 main_finetune.py --output_dir output/"$exp_name" --epochs 5 --warmup_epochs 1 --batch_size 1 --accum_iter 2 --num_workers 2 --max_words 512 --lr 3e-5 --min_lr 3e-6 --clip_grad 2 --weight_decay 0.02 --data_parallel "$data_parallel" --model_parallel_size 2 --checkpointing --llama_type llama --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" --no_visual --pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" --data_config $data_config --precision bf16 2>&1 | tee -a output/"$exp_name"/output.log

— Reply to this email directly, view it on GitHub https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/26, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM3OG25UNVSAUTAZFX6BUDTXUXXBDANCNFSM6AAAAAA3MTUM2I . You are receiving this because you are subscribed to this thread.Message ID: @.***>

VividLe commented 1 year ago

Thanks for this solid work. Could you please share me the GPU memory cost when finetuning LLaMA2-7B with LoRA + Bias-norm? How about the cost for LLaMA2-13B?

linziyi96 commented 1 year ago

For PEFT methods (and with gradient checkpointing enabled), the most memory consuming part should be the frozen model weights, which are about 14GB for 7B models and 26GB for 13B models (in BF16/FP16). I guess a >=24GB GPU is fine to run 7B PEFT and >=32GB GPU will run 13B PEFT.

Alpha-VLLM / LLaMA2-Accessory

llama2 13b out of memory on A800 #26