Closed LouisDo2108 closed 1 day ago
Hello, I'm Dengchun, the author of MixLoRA. Most MoE models, including MixLoRA, Mixtral, and Switch Transformers, function in a very similar way. They all compute the router balance loss at the end of the forward propagation process. The router logits output by each layer are concatenated along the first dimension (which has the shape (batch_size * sequence_length, num_experts)
) before the final computation. You can check this at:
https://github.com/TUDB-Labs/MoE-PEFT/blob/50984e12202b899926bd469a1deeab155e534018/moe_peft/modules/mix_lora.py#L109-L120
Hi,
I am currently implementing a custom MixLoRA model based on your code and had a question regarding the router loss.
From what I understand, the router loss is calculated for each layer that employs MixLoRA. Could you confirm if the final router loss for the entire model is the sum of the router losses computed at each MixLoRA-enabled layer?
https://github.com/TUDB-Labs/MoE-PEFT/blob/50984e12202b899926bd469a1deeab155e534018/moe_peft/model.py#L505-L507
Thank you in advance for your clarification!