Open nxphi47 opened 6 months ago
Hi! I believe that would work. In addition to configuring the expert count/top-k appropriately, you'll want to set moe_normalize_expert_weights
to 1.0 to match their post-top-k expert weight normalization. You'll have to handle any differences in how the load balancing loss is computed/returned as well.
Please let us know if you encounter any issues and we'd be more than happy to help debug :)
Hi, this is awesome work. I'm wondering if there is a minimal way to integrate megablocks into transformers codebase for the mixtral architecture?
Would simply replacing the
MixtralSparseMoeBlock
withdmoe.dMoE
with proper configuration works?Thanks!