This PR adds the ability to use FSDP's sync_module_states=True with QLoRA without any modifications to bitsandbytes or PEFT.
Linear4bit layers store the quantization information in the quant_state dictionary, which doesn't get synced by FSDP.sync_module_states=True because it only supports syncing parameters and buffers. Additionally, the shape of Linear4bit
is changed when quantizing which also prevents sync_module_states from working.
These issues are resolved by a custom quantization of all Linear4bit layers on all GPU ranks layer by layer, moving Rank 0 to CPU and setting all non-Rank 0 layers to meta tensors. This leaves the quant_state dictionary attached to all rank's Linear4bit layers.
When initializing the model for QLoRA training with PEFT, PEFT converts the quant_state dictionary tensors to the meta device, which breaks training. To prevent this from occurring, this PR replaces quant_state.to with a no-op during PEFT model creation and then restores quant_state.to after the LoRA model modification is over.
These two modifications allow using the low_memory option while finetuning with QLoRA. This should allow users to use models which cannot load into a single GPU's memory but can shard across multiple GPUs.
This PR adds the ability to use FSDP's
sync_module_states=True
with QLoRA without any modifications to bitsandbytes or PEFT.Linear4bit
layers store the quantization information in thequant_state
dictionary, which doesn't get synced byFSDP.sync_module_states=True
because it only supports syncing parameters and buffers. Additionally, the shape ofLinear4bit
is changed when quantizing which also preventssync_module_states
from working.These issues are resolved by a custom quantization of all
Linear4bit
layers on all GPU ranks layer by layer, moving Rank 0 to CPU and setting all non-Rank 0 layers to meta tensors. This leaves thequant_state
dictionary attached to all rank'sLinear4bit
layers.When initializing the model for QLoRA training with PEFT, PEFT converts the
quant_state
dictionary tensors to the meta device, which breaks training. To prevent this from occurring, this PR replacesquant_state.to
with a no-op during PEFT model creation and then restoresquant_state.to
after the LoRA model modification is over.These two modifications allow using the
low_memory
option while finetuning with QLoRA. This should allow users to use models which cannot load into a single GPU's memory but can shard across multiple GPUs.