Open casper-hansen opened 2 months ago
Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.
Paper: https://arxiv.org/abs/2404.02258
I found two implementations:
Very interested in this
We'd love community PRs for this! Happy to help review and design. It's not currently on our roadmap, but we are evaluating it.
Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.
Paper: https://arxiv.org/abs/2404.02258
I found two implementations: