How to set 'router_layers' when making BERT MoE?

Leeroo-AI / mergoo

A library for easily merging multiple LLM experts, and efficiently train the merged LLM.

https://www.leeroo.com/

GNU Lesser General Public License v3.0

360 stars 19 forks source link

How to set 'router_layers' when making BERT MoE? #4

Closed gauss5930 closed 2 months ago

gauss5930 commented 2 months ago

I was impressed with your project!! However, I am leaving an issue because I have some questions about experimenting using mergoo. I have tried to make the MoE BERT model with mergoo, however, I met some problems while setting the configuration creating the MoE BERT model.

The configuration of decoder models such as Mistral and Llama has been uploaded to README, but in the case of encoder models such as BERT, it is impossible to check how to set the router_layer. So, how should I set the router_layers when creating MoE with BERT model?

gitsailor5 commented 2 months ago

Hi @gauss5930, thank you for your interest in mergoo! For BERT models, the supported layer names in the router_layer are as follows:

["query", "key", "value", "dense"]

These names correspond to specific layers in the model architecture. Ideally you should be able to replace any linear layer with MOE layer.

gauss5930 commented 2 months ago

Thank you for the response @gitsailor5 ! I want to check my understanding, so is it okay to use any linear layer in model architecture to my taste for constructing the MoE model's router layers? Or are you saying that I can only choose and use the ["query", "key", "value", "dense"] you provided depending on my preference when constructing the BERT MoE model?

gitsailor5 commented 2 months ago

The MOE replacement works for the layers mentioned above for bert. If you want to extend it to other linear layers, it's straightforward. Here's an example: Original Layer:

self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)

MOE Layer:

self.q_proj = convert_linear_to_moe("q_proj", config, layer_idx, self.hidden_size, self.num_heads * self.head_dim, bias=False)

you can find the the details here : https://github.com/Leeroo-AI/mergoo/issues/2#issuecomment-2058708081

svjack commented 1 month ago

Thank you for the response @gitsailor5 ! I want to check my understanding, so is it okay to use any linear layer in model architecture to my taste for constructing the MoE model's router layers? Or are you saying that I can only choose and use the ["query", "key", "value", "dense"] you provided depending on my preference when constructing the BERT MoE model?

Where to get some pretrained Lora s on Bert ? Can you share some discussion about favorable tasks that merged Bert can accomplish ?