[Question] Model and Dataset Size

Question

Hi, I have 2 questions which I would like to post to the authors: 1) I observed that the model sizing are limited to the smaller scale (1-3B params), is there a specific reason for the selection of the model size when deciding on training the experts? What are the plausible challenges you would foresee scaling upwards into the middle range (13B-30B). 2) Ablation studies highlight that limited instruction tuned multimodal data hurts model sparsification when performing MoE training. Could you elaborate more on why this is so, and perhaps share some insights on what would be a reasonable amount of data required to achieve such sparsification.

Thanks very much for the great work.

PKU-YuanGroup / MoE-LLaVA

[Question] Model and Dataset Size #54

Question