foundation-model-stack / fms-acceleration

🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
Apache License 2.0
0 stars 4 forks source link

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Closed achew010 closed 1 month ago

achew010 commented 1 month ago

Description

This PR addresses #18 with the following contributions

TODO:

Tests

Reproduction command

accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path TheBloke/Llama-2-70B-GPTQ --acceleration_framework_config_file /data/aaron/experimental/test3/scripts/benchmarks/../../sample-configurations/accelerated-peft-autogptq-sample-configuration.yaml --packing True --max_seq_len 4096 --learning_rate 2e-4 --fp16 True --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.0 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '\n### Response:' --dataset_text_field 'output' --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 10 --training_data_path /data/aaron/experimental/test3/benchmark_outputs_final/data/cache.json --per_device_train_batch_size 2 --output_dir benchmark_outputs/exp_57/hf --skip_memory_metrics False

Comparison

Before Fix:

Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled. model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 417
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 78.80 45.40 36.14 429

After Fix:

With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 414
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 49.44 44.87 18.13 428
fabianlim commented 1 month ago

@achew010 can you update the top-level comment, with what was the previous memory allocation, and verify that the new measurements are obtained after reversing the hack in https://github.com/foundation-model-stack/fms-acceleration/pull/26/commits/80d631e64cd78d97c079b0346d90079e56d9f5f7