arpita8 / Awesome-Mixture-of-Experts-Papers

Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.
64 stars 0 forks source link
awesome computer-vision deep-learning large-language-models llm machine-learning mixture-of-experts rec recsys

Mixture-of-Experts-Papers Awesome

A curated list of exceptional papers and resources on Mixture of Experts and related topics.

News: Our Mixture of Experts survey has been released. The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs

Editor

Links

Mendeley | ResearchGate | PDF If our work has been of assistance to you, please feel free to cite our survey. Thank you.

@article{article,
author = {Vats, Arpita and Raja, Rahul and Jain, Vinija and Chadha, Aman},
year = {2024},
month = {08},
pages = {12},
title = {THE EVOLUTION OF MIXTURE OF EXPERTS: A SURVEY FROM BASICS TO BREAKTHROUGHS}
}

Table of Contents

Evolution in Sparse Mixture of Experts

Editor
Name Paper Venue Year
The Sparsely-Gated Mixture-of-Experts Layer Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer arXiv 2017

Collection of Recent MoE Papers

MoE in Visual Domain

Name Paper Venue Year
MoE-FFD MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection arXiv 2024
MLLMs MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection arXiv 2024
MoE-LLaVA MoE-LLaVA: Mixture of Experts for Large Vision-Language Models arXiv 2024
MOVA MoVA: Adapting Mixture of Vision Experts to Multimodal Context arXiv 2024
MetaBEV MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation arXiv 2023
AdaMV-MoE AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts CVPR 2023
ERNIE-ViLG 2.0 ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts arXiv 2023
M³ViT Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design arXiv 2022
LIMoE Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts arXiv 2022
MoEBERT MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation arXiv 2022
VLMo VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts arXiv 2022
DeepSpeed MoE DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv 2022
V-MoE Vision Mixture of Experts arXiv 2021
DSelect-k DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning arXiv 2021
MMoE Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts ACM 2018

MoE in LLMs

Name Paper Venue Year
LoRAMoE LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin arXiv 2024
Flan-MoE Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models ICLR 2024
RAPHAEL RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths arXiv 2024
Branch-Train-MiX Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM arXiv 2024
Self-MoE Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts arXiv 2024
CuMo CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-ExpertsCuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts arXiv 2024
MOELoRA MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models arXiv 2024
Mistral Mistral 7B arXiv 2023
HetuMoE HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System arXiv 2022
GLaM GLaM: Efficient Scaling of Language Models with Mixture-of-Experts arXiv 2022
eDiff-I eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers arXiv 2022

MoE for Scaling LLMs

Name Paper Venue Year
u-LLaVA u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model arXiv 2024
MoLE QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models arXiv 2024
Lory Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training arXiv 2024
Uni-MoE Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts arXiv 2024
MH-MoE Multi-Head Mixture-of-Experts arXiv 2024
DeepSeekMoE DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models arXiv 2024
Mini-Gemini Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models arXiv 2024
OpenMoE OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models arXiv 2024
TUTEL Tutel: Adaptive Mixture-of-Experts at Scale arXiv 2023
QMoE QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models arXiv 2023
Switch-NeRF Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields ICLR 2023
SaMoE SaMoE: Parameter Efficient MoE Language Models via Self-Adaptive Expert Combination ICLR 2023
JetMoE JetMoE: Reaching Llama2 Performance with 0.1M Dollars arXiv 2023
MegaBlocks MegaBlocks: Efficient Sparse Training with Mixture-of-Experts arXiv 2022
ST-MoE ST-MoE: Designing Stable and Transferable Sparse Expert Models arXiv 2022
Uni-Perceiver-MoE Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs NeurIPS 2022
SpeechMoE SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts arXiv 2021
Fully-Differential Sparse Transformer Sparse is Enough in Scaling Transformers arXiv 2021

MoE: Enhancing System Performance and Efficiency

Name Paper Venue Year
pMoE PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning arXiv 2024
HyperMoE HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts arXiv 2024
BlackMamba BlackMamba: Mixture of Experts for State-Space Models arXiv 2024
ScheMoE ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling arXiv 2024
Pre-Gates MoE Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference arXiv 2024
MoE-Mamba MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts arXiv 2024
Parameter-efficient MoEs Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning arXiv 2023
SMoE-Dropout Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers arXiv 2023
StableMoE StableMoE: Stable Routing Strategy for Mixture of Experts arXiv 2022
Alpa Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning arXiv 2022
BaGuaLu BaGuaLu: targeting brain scale pretrained models with over 37 million cores ACM 2022
MEFT MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter arXiv 2024
EdgeMoE EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models arXiv 2023
SE-MoE SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System arXiv 2022
NLLB No Language Left Behind: Scaling Human-Centered Machine Translation arXiv 2022
EvoMoE EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate arXiv 2022
FastMoE FastMoE: A Fast Mixture-of-Expert Training System arXiv 2021
ACE ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot ICCV 2021
M6-10T M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining arXiv 2021
GShard GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding arXiv 2020
PAD-Net PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing arXiv 2018

Integrating Mixture of Experts into Recommendation Algorithms

Name Paper Venue Year
MoME MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models arXiv 2024
CAME CAME: Competitively Learning a Mixture-of-Experts Model for First-stage Retrieval ACM 2024
SummaReranker SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization arXiv 2022
MDFEND MDFEND: Multi-domain Fake News Detection arXiv 2022
PLE PLE outperforming state-of-the-art MTL models RecSys 2021

Python Libraries for MoE

Name Paper Venue Year
MoE-Infinity MoE-Infinity: Offloading-Efficient MoE Model Serving arXiv 2024
SMT 2.0 SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes arXiv 2023


Hope our survey with collection of all the recent MoE can help your work.