hello
You are in figure 2.architecture of cumo.cumo companies sparse top-k moe blocks into the clip vision encoder and vision-language MLP connector. thereby improving the multimodal LLM capabilities from the vision side.
Just integrate MOE into CLIP visual encoder and MLP
But I see in your code that the llm also involves the change of moe.
I want to ask you that the basic structure of llm is similar to MoE-LLaVA.
Is it composed of self-attention layer and MLP layer?
for llm, we used mistral-7B and mixtral 8x7B, we also tried upcycled moe in LLM as shown in Table 5, but the upcycled mistral 4x7B and 8x7B is not as good as the pretrained mixtral 8x7b.
hello You are in figure 2.architecture of cumo.cumo companies sparse top-k moe blocks into the clip vision encoder and vision-language MLP connector. thereby improving the multimodal LLM capabilities from the vision side.
Just integrate MOE into CLIP visual encoder and MLP But I see in your code that the llm also involves the change of moe.
I want to ask you that the basic structure of llm is similar to MoE-LLaVA. Is it composed of self-attention layer and MLP layer?