Low Memory works with QLoRA

This PR adds the ability to use FSDP's sync_module_states=True with QLoRA without any modifications to bitsandbytes or PEFT.

Linear4bit layers store the quantization information in the quant_state dictionary, which doesn't get synced by FSDP.sync_module_states=True because it only supports syncing parameters and buffers. Additionally, the shape of Linear4bit is changed when quantizing which also prevents sync_module_states from working.

These issues are resolved by a custom quantization of all Linear4bit layers on all GPU ranks layer by layer, moving Rank 0 to CPU and setting all non-Rank 0 layers to meta tensors. This leaves the quant_state dictionary attached to all rank's Linear4bit layers.

When initializing the model for QLoRA training with PEFT, PEFT converts the quant_state dictionary tensors to the meta device, which breaks training. To prevent this from occurring, this PR replaces quant_state.to with a no-op during PEFT model creation and then restores quant_state.to after the LoRA model modification is over.

These two modifications allow using the low_memory option while finetuning with QLoRA. This should allow users to use models which cannot load into a single GPU's memory but can shard across multiple GPUs.

AnswerDotAI / fsdp_qlora

Low Memory works with QLoRA #4