There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...
Observations on Llama2-70B
1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning
We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.
Acceleration Type
Model Name
Num GPUS
Batch Size
Throughput (toks/sec) / Device
Avg Mem Usage (GiB)
accelerated-peft-bnb
NousResearch/Llama-2-70b-hf
1
4
445
65.0
accelerated-peft-autogptq
TheBloke/Llama-2-70b-GPTQ
1
4
451
59.0
2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning
However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.
Acceleration Type
Model Name
Num GPUS
Batch Size
Throughput (toks/sec) / Device
Avg Mem Usage (GiB)
accelerated-peft-bnb
NousResearch/Llama-2-70b-hf
2
2
422
~29.5~ 36.3
accelerated-peft-autogptq
TheBloke/Llama-2-70b-GPTQ
2
2
438
~53.2~ 61.2
3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning
Despite sharding the model (Ngpus=1 -> Ngpus=2) and halving the batchsize (bs=4/device -> bs=2/device) for 70B models, we also noticed that the memory usage per device did not decrease (59 GiB -> 61 GiB )
Acceleration Type
Model Name
Num GPUS
Batch Size
Throughput (toks/sec) / Device
Avg Mem Usage (GiB)
accelerated-peft-autogptq
TheBloke/Llama-2-70b-GPTQ
1
4
451
59.0
accelerated-peft-autogptq
TheBloke/Llama-2-70b-GPTQ
2
2
438
~53.2~ 61.2
Observations on Mixtral
The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.
1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning
Acceleration Type
Model Name
Num GPUS
Batch Size
Throughput (toks/sec) / Device
Avg Mem Usage (GiB)
accelerated-peft-autogptq
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
1
4
1854
23.9
accelerated-peft-autogptq
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
2
4
1821
37.0
2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning
Acceleration Type
Model Name
Num GPUS
Batch Size
Throughput (toks/sec) / Device
Avg Mem Usage (GiB)
accelerated-peft-bnb
mistralai/Mixtral-8x7B-Instruct-v0.1
1
4
1793
24.6
accelerated-peft-bnb
mistralai/Mixtral-8x7B-Instruct-v0.1
2
4
1731
26.2
This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.
Issue:
There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...
Observations on Llama2-70B
1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning
We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.
2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning
However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.
3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning
Observations on Mixtral
The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.
1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning
2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning
This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.