databricks / megablocks

Apache License 2.0
1.11k stars 154 forks source link

Implement Mixture of Depth and Experts (MoDE) #103

Open casper-hansen opened 2 months ago

casper-hansen commented 2 months ago

Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.

Paper: https://arxiv.org/abs/2404.02258

I found two implementations:

ehartford commented 2 months ago

Very interested in this

mvpatel2000 commented 2 months ago

We'd love community PRs for this! Happy to help review and design. It's not currently on our roadmap, but we are evaluating it.