Memory Consumption for GPTQ-LoRA is higher than QLoRA in Distributed Finetuning

Issue:

There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...

Observations on Llama2-70B

1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning

We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	NousResearch/Llama-2-70b-hf	1	4	445	65.0
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	1	4	451	59.0

2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning

However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	NousResearch/Llama-2-70b-hf	2	2	422	~29.5~ 36.3
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	2	2	438	~53.2~ 61.2

3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning

Despite sharding the model (Ngpus=1 -> Ngpus=2) and halving the batchsize (bs=4/device -> bs=2/device) for 70B models, we also noticed that the memory usage per device did not decrease (59 GiB -> 61 GiB )	Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	1	4	451	59.0
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	2	2	438	~53.2~ 61.2

Observations on Mixtral

The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.

1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-autogptq	TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	1	4	1854	23.9
accelerated-peft-autogptq	TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	2	4	1821	37.0

2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	1793	24.6
accelerated-peft-bnb	mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	1731	26.2

This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.

foundation-model-stack / fms-acceleration