TUDB-Labs / MoE-PEFT

An Efficient LLM Fine-Tuning Factory Optimized for MoE PEFT
Apache License 2.0
21 stars 4 forks source link

Question about the router loss of MixLoRA #6

Closed LouisDo2108 closed 1 day ago

LouisDo2108 commented 2 days ago

Hi,

I am currently implementing a custom MixLoRA model based on your code and had a question regarding the router loss.

From what I understand, the router loss is calculated for each layer that employs MixLoRA. Could you confirm if the final router loss for the entire model is the sum of the router losses computed at each MixLoRA-enabled layer?

https://github.com/TUDB-Labs/MoE-PEFT/blob/50984e12202b899926bd469a1deeab155e534018/moe_peft/model.py#L505-L507

Thank you in advance for your clarification!

mikecovlee commented 2 days ago

Hello, I'm Dengchun, the author of MixLoRA. Most MoE models, including MixLoRA, Mixtral, and Switch Transformers, function in a very similar way. They all compute the router balance loss at the end of the forward propagation process. The router logits output by each layer are concatenated along the first dimension (which has the shape (batch_size * sequence_length, num_experts)) before the final computation. You can check this at: https://github.com/TUDB-Labs/MoE-PEFT/blob/50984e12202b899926bd469a1deeab155e534018/moe_peft/modules/mix_lora.py#L109-L120