Description

This PR addresses #18 with the following contributions

Introduce patch on AutoGPTQ's make_sure_no_tensor_in_meta_device to avoid raising an error when model has no bias in low memory mode
Workaround to configuring device_map to cpu when loading checkpoints to avoid gpu memory consumption before trainer initialization. Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it to meta device. QLoRA currently loads quantized models to cpu in low memory mode as well. See here.

TODO:

Actual device mapping to meta device

Tests

Reproduction command

accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path TheBloke/Llama-2-70B-GPTQ --acceleration_framework_config_file /data/aaron/experimental/test3/scripts/benchmarks/../../sample-configurations/accelerated-peft-autogptq-sample-configuration.yaml --packing True --max_seq_len 4096 --learning_rate 2e-4 --fp16 True --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.0 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '\n### Response:' --dataset_text_field 'output' --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 10 --training_data_path /data/aaron/experimental/test3/benchmark_outputs_final/data/cache.json --per_device_train_batch_size 2 --output_dir benchmark_outputs/exp_57/hf --skip_memory_metrics False

Comparison

Before Fix:

Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled.	model name	framework config	num gpus	per device train batch size	nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	accelerated-peft-bnb	2	2	51.40	46.52	19.17	417
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	2	78.80	45.40	36.14	429

After Fix:

With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA	model name	framework config	num gpus	per device train batch size	nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	accelerated-peft-bnb	2	2	51.40	46.52	19.17	414
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	2	49.44	44.87	18.13	428

foundation-model-stack / fms-acceleration

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Description

Tests

Reproduction command

Comparison

Before Fix:

After Fix: