PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.9k stars 121 forks source link

[Question] Model and Dataset Size #54

Open adrielkuek opened 6 months ago

adrielkuek commented 6 months ago

Question

Hi, I have 2 questions which I would like to post to the authors: 1) I observed that the model sizing are limited to the smaller scale (1-3B params), is there a specific reason for the selection of the model size when deciding on training the experts? What are the plausible challenges you would foresee scaling upwards into the middle range (13B-30B). 2) Ablation studies highlight that limited instruction tuned multimodal data hurts model sparsification when performing MoE training. Could you elaborate more on why this is so, and perhaps share some insights on what would be a reasonable amount of data required to achieve such sparsification.

Thanks very much for the great work.