Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
2.63k stars 168 forks source link

Merging adapter weights into pretrained model #109

Closed wj210 closed 7 months ago

wj210 commented 8 months ago

from #41, i can see that util.tensor_parallel.load_tensor_parallel_model_list is used to load the latest weights for the same module name. I assume that's for multiple copies of adapter weights or so.

What if i want to merge the lora_a and lora_b into the pretrained models, how could i do so? In LoraColumnParallelLinear, Y = WX + BAX. Is there a function to save the model after training. where the frozen W = W + AB? where AB is the trained adapter weights. or a function which loads the saved model (frozen pretrained with AB) and merge them together while deleting the lora_a and lora_b such that i can load a fresh set of lora_a,b on another task and freeze the merged model?

ChrisLiu6 commented 8 months ago

Sorry, we have not implemented the function that merges LoRA into the original weights. However, it should be easy to implement. The following is the pseudo-code:

for param_name, param in ckpts:
  if param_name.endswith(".weight") and param_name[:-7]+".lora_a.weight" in ckpts:
    lora_a = ckpts[param_name[:-7]+".lora_a.weight"]
    lora_b = ckpts[param_name[:-7]+".lora_b.weight"]
    param = param + lora_b @ lora_a
    ckpts[param_name] = param
wj210 commented 7 months ago

hi, thanks for your response. Why do we use [:-7]? i suppose in the attention blocks and FF layers, each of the layer component (w0,wq,wv,wk in attention) and normal layer for FF, there is a lora_a,b component. All i have to do, is just implement param = param + lora_b @ lora_a for each of them?

i created this function which loads the saved lora weights and merge them before reinitalizing the lora weights again for the next fine-tuning task. Would this suffice? `

Perform matrix multiplication (AB)

            AB = torch.matmul(B,A)  # Make sure dimensions are compatible


            # reset lora_a and lora_b
            self.lora_a = RowParallelLinear(self.in_features, self.lora_rank, bias=False, input_is_parallel=True)
            # workaround because trunc_normal_ does not currently support bfloat16
            _ = init.trunc_normal_(, std=.02)
            self.lora_b = nn.Linear(self.lora_rank, self.out_features, bias=False)

Just to add on another question. in why do we take away the bos token during packing? would we not want the model to know that the next context is separated?

ChrisLiu6 commented 7 months ago


Why [:-7]:

The length of `.weight' is 7, "llma.layers.0.attn.wq.weight"[:-7]+".lora_a.weight" == "llma.layers.0.attn.wq.lora_a.weight"

Does your code suffice:


packed dataset

In the script, we seperate contexts with [EOS] tokens and [BOS] tokens are discarded. You may also preserve the [BOS] tokens if you like, as it should not make a large difference.