Add Low-Rank Adapters injection into base models

bilelomrani1 commented 1 year ago

Low-Rank Adaptation (LoRA) has become the de-facto parameter-efficient finetuning technique to adapt a base language model to a specific task. curated-transformers already supports dynamic quantization using bitsandbytes, hence adding some utilities to inject trainable adapters opens the door to using curated-transformers as a replacement to the HuggingFace transformers + peft stack. This could also enable a very nice finetuning integration into spaCy in the future.

For reference, I find this implementation in lit-gpt really readable.

Do you find this idea interesting?

If so, as for the user-facing API, drawing inspiration from HuggingFace peft it could look something like

# Load and quantize the base model
model = AutoGenerator.from_hf_hub(
    name="meta-llama/Llama-2-7b-chat-hf",
    device=torch.device("cuda", index=0),
    quantization_config=BitsAndBytesConfig.for_4bit(
        quantization_dtype=Dtype4Bit.FP4,
        compute_dtype=torch.bfloat16,
        double_quantization=True,
    ),
)

# Replace targeted linear layers by `LoRALayer` that wrap the original weights
model_with_adapters = inject_adapters(
    base_model=model,
    lora_config=LoraConfig(
        rank=64,
        alpha=16,
        dropout=0.1,
        bias=LoraBias.NONE,
        target_modules=[...]
    ),
)

shadeMe commented 1 year ago

Thanks for the suggestion! Fine-tuning (and LoRA support) is already on our roadmap - We'll definitely be looking into this.

bilelomrani1 commented 1 year ago

Thank you, that's great! Out of curiosity, is your roadmap public and visible somewhere?

shadeMe commented 1 year ago

It's not publicly available at the moment.

explosion / curated-transformers

Add Low-Rank Adapters injection into base models #312