PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.9k stars 121 forks source link

DeepSpeed MoE 问题 #64

Open BlackBearBiscuit opened 5 months ago

BlackBearBiscuit commented 5 months ago

Describe the issue

Issue: 想请教一下是否在13B以上的MoE模型上实验过? 我使用了ZeRO-2,EP_SIZE=8; 在初始化optimizer状态时会报cuda: out of memory. 而ZeRO-3则不支持MoE, 由于设备限制,我也无法采用offload加载; 是不是还是得考虑megatron-deepspeed?

Environment:

GPU: 8×A100-80G

Deepspeed version:0.10.0
Torch version:
Transformers version:
Tokenizers version:

Command:

PASTE THE COMMANDS HERE.

Log:

PASTE THE LOGS HERE.

Screenshots: You may attach screenshots if it better explains the issue.