Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.7k stars 170 forks source link

VRAM requirements for SPHINX-MoE #151

Closed sisgrad closed 8 months ago

sisgrad commented 8 months ago

Hi. Thanks for this awesome project! Could you please clarify minimal VRAM requrements for inference and finetuning of SPHINX-MoE? Does it support llamacpp-like inference strategy (when VRAM amount is low, but model splitted in chunks, running sequentially)?

ChrisLiu6 commented 8 months ago

LLaMA2-Accessory does not support splitting the mode sequentially; in contrast, it supports splitting horizontally through tensor and expert parallelism. Specifically, following Megatron, the attention layers are split according to heads and FFN are split along hidden dim, the N split submodels are distributed among N GPUs so ideally during forward each GPUs host 1/N of total parameters and 1/N of total computation. It also means that you can lower the burden on individual GPU by using more GPUs.

We used 32 A100 80G to train SPHINX-MoE. 16 A100 80G should also be okay. For inference 2 A100 80G or 8 24G GPUs should be enough without the need of quantization.