SHI-Labs / CuMo

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Apache License 2.0
117 stars 8 forks source link

Is MOE used for LLM? #12

Open dana-niu opened 6 days ago

dana-niu commented 6 days ago

hello You are in figure 2.architecture of cumo.cumo companies sparse top-k moe blocks into the clip vision encoder and vision-language MLP connector. thereby improving the multimodal LLM capabilities from the vision side.

Just integrate MOE into CLIP visual encoder and MLP But I see in your code that the llm also involves the change of moe.

I want to ask you that the basic structure of llm is similar to MoE-LLaVA. Is it composed of self-attention layer and MLP layer?

chrisjuniorli commented 6 days ago

for llm, we used mistral-7B and mixtral 8x7B, we also tried upcycled moe in LLM as shown in Table 5, but the upcycled mistral 4x7B and 8x7B is not as good as the pretrained mixtral 8x7b.