Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

Merging adapter weights into pretrained model #109

Closed wj210 closed 7 months ago

wj210 commented 8 months ago

from #41, i can see that util.tensor_parallel.load_tensor_parallel_model_list is used to load the latest weights for the same module name. I assume that's for multiple copies of adapter weights or so.

What if i want to merge the lora_a and lora_b into the pretrained models, how could i do so? In LoraColumnParallelLinear, Y = WX + BAX. Is there a function to save the model after training. where the frozen W = W + AB? where AB is the trained adapter weights. or a function which loads the saved model (frozen pretrained with AB) and merge them together while deleting the lora_a and lora_b such that i can load a fresh set of lora_a,b on another task and freeze the merged model?

ChrisLiu6 commented 8 months ago

Sorry, we have not implemented the function that merges LoRA into the original weights. However, it should be easy to implement. The following is the pseudo-code:


for param_name, param in ckpts:
  if param_name.endswith(".weight") and param_name[:-7]+".lora_a.weight" in ckpts:
    lora_a = ckpts[param_name[:-7]+".lora_a.weight"]
    lora_b = ckpts[param_name[:-7]+".lora_b.weight"]
    param = param + lora_b @ lora_a
    ckpts[param_name] = param
wj210 commented 7 months ago

hi, thanks for your response. Why do we use [:-7]? i suppose in the attention blocks and FF layers, each of the layer component (w0,wq,wv,wk in attention) and normal layer for FF, there is a lora_a,b component. All i have to do, is just implement param = param + lora_b @ lora_a for each of them?

i created this function which loads the saved lora weights and merge them before reinitalizing the lora weights again for the next fine-tuning task. Would this suffice? `

Perform matrix multiplication (AB)

            AB = torch.matmul(B,A)  # Make sure dimensions are compatible

            self.weight.data.add_(AB)

            # reset lora_a and lora_b
            self.lora_a = RowParallelLinear(self.in_features, self.lora_rank, bias=False, input_is_parallel=True)
            # workaround because trunc_normal_ does not currently support bfloat16
            _ = init.trunc_normal_(self.lora_a.weight.data.to(torch.float32), std=.02)
            self.lora_a.weight.data.copy_(_)
            self.lora_b = nn.Linear(self.lora_rank, self.out_features, bias=False)
            nn.init.zeros_(self.lora_b.weight)`

Just to add on another question. in https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/1265a34d2e98ac58c7b4fbd65fec730c81f6cfcd/accessory/tools/generate_packed_data.py#L26C13-L26C13 why do we take away the bos token during packing? would we not want the model to know that the next context is separated?

ChrisLiu6 commented 7 months ago

LoRA:

Why [:-7]:

The length of `.weight' is 7, "llma.layers.0.attn.wq.weight"[:-7]+".lora_a.weight" == "llma.layers.0.attn.wq.lora_a.weight"

Does your code suffice:

Yes

packed dataset

In the script, we seperate contexts with [EOS] tokens and [BOS] tokens are discarded. You may also preserve the [BOS] tokens if you like, as it should not make a large difference.