Mixture-of-Experts-Papers

A curated list of exceptional papers and resources on Mixture of Experts and related topics.

News: Our Mixture of Experts survey has been released. The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs

Links

Mendeley | ResearchGate | PDF If our work has been of assistance to you, please feel free to cite our survey. Thank you.

@article{article,
author = {Vats, Arpita and Raja, Rahul and Jain, Vinija and Chadha, Aman},
year = {2024},
month = {08},
pages = {12},
title = {THE EVOLUTION OF MIXTURE OF EXPERTS: A SURVEY FROM BASICS TO BREAKTHROUGHS}
}

Sparse Mixture of Experts
Table of Contents

Evolution in Sparse Mixture of Experts

Name	Paper	Venue	Year
The Sparsely-Gated Mixture-of-Experts Layer	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	arXiv	2017

Collection of Recent MoE Papers

MoE in Visual Domain

Name	Paper	Venue	Year
MoE-FFD	MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection	arXiv	2024
MLLMs	MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection	arXiv	2024
MoE-LLaVA	MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	arXiv	2024
MOVA	MoVA: Adapting Mixture of Vision Experts to Multimodal Context	arXiv	2024
MetaBEV	MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation	arXiv	2023
AdaMV-MoE	AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts	CVPR	2023
ERNIE-ViLG 2.0	ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts	arXiv	2023
M³ViT	Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design	arXiv	2022
LIMoE	Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts	arXiv	2022
MoEBERT	MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation	arXiv	2022
VLMo	VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts	arXiv	2022
DeepSpeed MoE	DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale	arXiv	2022
V-MoE	Vision Mixture of Experts	arXiv	2021
DSelect-k	DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning	arXiv	2021
MMoE	Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts	ACM	2018

MoE in LLMs

Name	Paper	Venue	Year
LoRAMoE	LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin	arXiv	2024
Flan-MoE	Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models	ICLR	2024
RAPHAEL	RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths	arXiv	2024
Branch-Train-MiX	Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM	arXiv	2024
Self-MoE	Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts	arXiv	2024
CuMo	CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-ExpertsCuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024
MOELoRA	MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models	arXiv	2024
Mistral	Mistral 7B	arXiv	2023
HetuMoE	HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System	arXiv	2022
GLaM	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	arXiv	2022
eDiff-I	eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers	arXiv	2022

MoE for Scaling LLMs

Name	Paper	Venue	Year
u-LLaVA	u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model	arXiv	2024
MoLE	QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models	arXiv	2024
Lory	Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training	arXiv	2024
Uni-MoE	Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	arXiv	2024
MH-MoE	Multi-Head Mixture-of-Experts	arXiv	2024
DeepSeekMoE	DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models	arXiv	2024
Mini-Gemini	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024
OpenMoE	OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models	arXiv	2024
TUTEL	Tutel: Adaptive Mixture-of-Experts at Scale	arXiv	2023
QMoE	QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models	arXiv	2023
Switch-NeRF	Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields	ICLR	2023
SaMoE	SaMoE: Parameter Efficient MoE Language Models via Self-Adaptive Expert Combination	ICLR	2023
JetMoE	JetMoE: Reaching Llama2 Performance with 0.1M Dollars	arXiv	2023
MegaBlocks	MegaBlocks: Efficient Sparse Training with Mixture-of-Experts	arXiv	2022
ST-MoE	ST-MoE: Designing Stable and Transferable Sparse Expert Models	arXiv	2022
Uni-Perceiver-MoE	Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs	NeurIPS	2022
SpeechMoE	SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts	arXiv	2021
Fully-Differential Sparse Transformer	Sparse is Enough in Scaling Transformers	arXiv	2021

MoE: Enhancing System Performance and Efficiency

Name	Paper	Venue	Year
pMoE	PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning	arXiv	2024
HyperMoE	HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts	arXiv	2024
BlackMamba	BlackMamba: Mixture of Experts for State-Space Models	arXiv	2024
ScheMoE	ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling	arXiv	2024
Pre-Gates MoE	Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference	arXiv	2024
MoE-Mamba	MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts	arXiv	2024
Parameter-efficient MoEs	Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning	arXiv	2023
SMoE-Dropout	Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers	arXiv	2023
StableMoE	StableMoE: Stable Routing Strategy for Mixture of Experts	arXiv	2022
Alpa	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	arXiv	2022
BaGuaLu	BaGuaLu: targeting brain scale pretrained models with over 37 million cores	ACM	2022
MEFT	MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter	arXiv	2024
EdgeMoE	EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models	arXiv	2023
SE-MoE	SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System	arXiv	2022
NLLB	No Language Left Behind: Scaling Human-Centered Machine Translation	arXiv	2022
EvoMoE	EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate	arXiv	2022
FastMoE	FastMoE: A Fast Mixture-of-Expert Training System	arXiv	2021
ACE	ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot	ICCV	2021
M6-10T	M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining	arXiv	2021
GShard	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	arXiv	2020
PAD-Net	PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing	arXiv	2018

Integrating Mixture of Experts into Recommendation Algorithms

Name	Paper	Venue	Year
MoME	MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models	arXiv	2024
CAME	CAME: Competitively Learning a Mixture-of-Experts Model for First-stage Retrieval	ACM	2024
SummaReranker	SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization	arXiv	2022
MDFEND	MDFEND: Multi-domain Fake News Detection	arXiv	2022
PLE	PLE outperforming state-of-the-art MTL models	RecSys	2021

Python Libraries for MoE

Name	Paper	Venue	Year
MoE-Infinity	MoE-Infinity: Offloading-Efficient MoE Model Serving	arXiv	2024
SMT 2.0	SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes	arXiv	2023

Hope our survey with collection of all the recent MoE can help your work.

arpita8 / Awesome-Mixture-of-Experts-Papers

readme