VRAM requirements for SPHINX-MoE

Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development

Other

2.7k stars 170 forks source link

LLaMA2-Accessory does not support splitting the mode sequentially; in contrast, it supports splitting horizontally through tensor and expert parallelism. Specifically, following Megatron, the attention layers are split according to heads and FFN are split along hidden dim, the N split submodels are distributed among N GPUs so ideally during forward each GPUs host 1/N of total parameters and 1/N of total computation. It also means that you can lower the burden on individual GPU by using more GPUs.

We used 32 A100 80G to train SPHINX-MoE. 16 A100 80G should also be okay. For inference 2 A100 80G or 8 24G GPUs should be enough without the need of quantization.

Alpha-VLLM / LLaMA2-Accessory

VRAM requirements for SPHINX-MoE #151