foundation-model-stack / fms-acceleration

🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
Apache License 2.0
0 stars 4 forks source link

Memory Consumption for GPTQ-LoRA is higher than QLoRA in Distributed Finetuning #12

Closed achew010 closed 1 month ago

achew010 commented 1 month ago

Issue:

There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...

Observations on Llama2-70B

1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning

We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.

Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb NousResearch/Llama-2-70b-hf 1 4 445 65.0
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 1 4 451 59.0

2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning

However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.

Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb NousResearch/Llama-2-70b-hf 2 2 422 ~29.5~ 36.3
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 2 2 438 ~53.2~ 61.2

3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning

Despite sharding the model (Ngpus=1 -> Ngpus=2) and halving the batchsize (bs=4/device -> bs=2/device) for 70B models, we also noticed that the memory usage per device did not decrease (59 GiB -> 61 GiB ) Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 1 4 451 59.0
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 2 2 438 ~53.2~ 61.2

Observations on Mixtral

The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.

1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning
Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-autogptq TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ ​1 4 1854 23.9​
accelerated-peft-autogptq TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ ​2 4 1821 37.0​
2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning
Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb mistralai/Mixtral-8x7B-Instruct-v0.1 ​1 4 1793 24.6​
accelerated-peft-bnb mistralai/Mixtral-8x7B-Instruct-v0.1 ​2 4 1731 26.2​

This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.

fabianlim commented 1 month ago

This issue is completed for FSDP + GPTQ triton v2, and closing for now