Applying Lora Layers to Attention Operators

Description of changes:

This pull request adds support for LoRA layers in the Incremental Multi-Head Self-Attention (IncMultiHeadSelfAttention) operator. LoRA adds low-rank adaptation modules to the attention projections to enable efficient fine-tuning.

The majority of existing LoRA models apply the adaptation layers to the attention q/k/v projections and the output projection. This pull request enables FlexFlow to support such models by splitting the QKV projection and the output projection into separate linear layers, allowing the LoRA layers to be applied to them individually.

Changes:

IncMultiHeadSelfAttentionMeta Struct:

Added fields for LoRA-related parameters and memory allocations: lora_q_proj, lora_k_proj, lora_v_proj, lora_output_proj, lora_q_proj_weights, lora_k_proj_weights, lora_v_proj_weights, and lora_output_proj_weights.

CUDA Kernels:

Modified the compute_qkv_kernel function to apply LoRA layers to the targeted projections (q_proj, k_proj, v_proj) based on the LoRA configuration.
Updated the compute_o_prod_bias function to apply the LoRA layer to the output projection if targeted.
Implemented the apply_lora_layer kernel to apply the LoRA weights to the corresponding projections.

Initialization and Memory Allocation:

Updated the constructors of IncMultiHeadSelfAttention and IncMultiHeadSelfAttentionMeta to accept LoRA-related parameters and initialize the corresponding fields.
Modified the memory allocation in IncMultiHeadSelfAttentionMeta to allocate memory for LoRA weights based on the LoRA configuration.

Usage

To use the LoRA functionality in FlexFlow, users need to specify the LoRA configuration when creating the IncMultiHeadSelfAttention operator. This includes setting the lora_q_proj, lora_k_proj, lora_v_proj, and lora_output_proj parameters to indicate which projections should have LoRA applied.

The LoRA weights should be passed to the model along with the pre-trained weights. The framework will handle the initialization and application of the LoRA layers during inference.

Testing(TODO)

The modified implementation should be tested with existing LoRA models in Python to ensure that the output matches the expected results. The testing process involves running inference with LoRA-adapted models and comparing the outputs between the FlexFlow implementation and the reference Python implementation.

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

flexflow / FlexFlow