foundation-model-stack / fms-acceleration

šŸš€ Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
Apache License 2.0
6 stars 9 forks source link

Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout #50

Open achew010 opened 4 months ago

achew010 commented 4 months ago

Description

Update: Previously it was reported that the OOM was only for BNB, but now it is observed for Quantized Peft in general even for GPTQ. See #106

Outliers image

Previous description below describing issue only for BNB

BNB experiments run out of memory in new benchmarks that set lora_dropout=0.1.

Benchmark framework_config peft_method model_name_or_path num_gpus per_device_train_batch_size lora dropout Peak Memory in Bytes
Reference accelerated-peft-bnb lora NousResearch/Llama-2-70b-hf 2 4 0. 72.39
New accelerated-peft-bnb lora NousResearch/Llama-2-70b-hf 2 4 0.1 0.
Compared to AutoGPTQ, we don't notice this issue Benchmark framework_config peft_method model_name_or_path num_gpus per_device_train_batch_size lora dropout Peak Memory in Bytes
Reference accelerated-peft-autogptq lora NousResearch/Llama-2-70b-hf 2 4 0. 70.14
New accelerated-peft-autogptq lora NousResearch/Llama-2-70b-hf 2 4 0.1 71.7

There might be a slight overhead in the dropout implementation that causes the experiment to run out of memory for large models

Reproduce Issue

Lora Dropout=0. enters training

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0. --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Lora Dropout=0.1 runs out of memory

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.1 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False
fabianlim commented 5 days ago

While this issue was originally reported for BNB, we have now seen it also for Quantized Peft in general in #106 . Updating the issue to reflect the general case.