flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.59k stars 218 forks source link

Applying Lora Layers to Attention Operators #1381

Closed april-yyt closed 1 month ago

april-yyt commented 2 months ago

Description of changes:

This pull request adds support for LoRA layers in the Incremental Multi-Head Self-Attention (IncMultiHeadSelfAttention) operator. LoRA adds low-rank adaptation modules to the attention projections to enable efficient fine-tuning.

The majority of existing LoRA models apply the adaptation layers to the attention q/k/v projections and the output projection. This pull request enables FlexFlow to support such models by splitting the QKV projection and the output projection into separate linear layers, allowing the LoRA layers to be applied to them individually.

Changes:

IncMultiHeadSelfAttentionMeta Struct:

CUDA Kernels:

Initialization and Memory Allocation:

Usage

To use the LoRA functionality in FlexFlow, users need to specify the LoRA configuration when creating the IncMultiHeadSelfAttention operator. This includes setting the lora_q_proj, lora_k_proj, lora_v_proj, and lora_output_proj parameters to indicate which projections should have LoRA applied.

The LoRA weights should be passed to the model along with the pre-trained weights. The framework will handle the initialization and application of the LoRA layers during inference.

Testing(TODO)

The modified implementation should be tested with existing LoRA models in Python to ensure that the output matches the expected results. The testing process involves running inference with LoRA-adapted models and comparing the outputs between the FlexFlow implementation and the reference Python implementation.

Related Issues:

Linked Issues:

Issues closed by this PR:


This change is Reviewable