Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

Questions about SPHINX inference #121

Closed hithqd closed 7 months ago

hithqd commented 7 months ago

Hi, thank you for your work. And I am trying to use the SPHINX inference code you provided. And I use the Single-GPU Inference code in the README.md, but OutOfMemoryError is prompted. How to fix it?

ChrisLiu6 commented 7 months ago

You can try:

  1. using 2 GPUs; OR
  2. quantization, see https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/114
hithqd commented 7 months ago

Thanks. But when I use 2 gpus and run the demo with print(response) , the output is

image

Does it mean it infers in the 2 gpus seperately? And if I want to output one response, how should I do?

ChrisLiu6 commented 7 months ago

No, the 2 gpus work collaboratively instead of separately. In short, with model parallel, the weight of embedding and linear layers are split and distributed among multiple gpus, namely each GPU only holds a part of the complete model. The matrix multiplications are also divided into submatrix multiplications correspondingly, and the results are gathered in a way that the computation is equivalent to the classical single-gpu computation. Therefore, while both processes print the complete output, each of which has only conducted a part of the computation necessary for working the result out. That's why you can accommodate the inference on 2 gpus but cannot make it on one.

If you only want to see the output once, you can block the output from ranks other than rank0, whose implementation can be found at https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/util/misc.py#L45