Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout

Description

Update: Previously it was reported that the OOM was only for BNB, but now it is observed for Quantized Peft in general even for GPTQ. See #106

Outliers

Previous description below describing issue only for BNB

BNB experiments run out of memory in new benchmarks that set lora_dropout=0.1.

Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.	72.39
New	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	0.

Compared to AutoGPTQ, we don't notice this issue	Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.	70.14
New	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	71.7

There might be a slight overhead in the dropout implementation that causes the experiment to run out of memory for large models

Reproduce Issue

Lora Dropout=0. enters training

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0. --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Lora Dropout=0.1 runs out of memory

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.1 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

foundation-model-stack / fms-acceleration

Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout #50

Description

Reproduce Issue

Lora Dropout=0. enters training

Lora Dropout=0.1 runs out of memory