Implement Mixture of Depth and Experts (MoDE)

databricks / megablocks

Apache License 2.0

1.11k stars 154 forks source link

Implement Mixture of Depth and Experts (MoDE) #103

Open casper-hansen opened 2 months ago

casper-hansen commented 2 months ago

Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster training and inference due to increased sparsity.

Paper: https://arxiv.org/abs/2404.02258

I found two implementations:

https://github.com/epfml/llm-baselines/blob/mixture_of_depth/src/models/mod.py
https://github.com/kyegomez/Mixture-of-Depths/blob/main/mixture_of_depths/main.py

ehartford commented 2 months ago

Very interested in this

mvpatel2000 commented 2 months ago

We'd love community PRs for this! Happy to help review and design. It's not currently on our roadmap, but we are evaluating it.