Questions about SPHINX inference

hithqd commented 11 months ago

Hi, thank you for your work. And I am trying to use the SPHINX inference code you provided. And I use the Single-GPU Inference code in the README.md, but OutOfMemoryError is prompted. How to fix it?

ChrisLiu6 commented 11 months ago

You can try:

using 2 GPUs; OR
quantization, see https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/114

hithqd commented 11 months ago

Thanks. But when I use 2 gpus and run the demo with print(response) , the output is

Does it mean it infers in the 2 gpus seperately? And if I want to output one response, how should I do?

ChrisLiu6 commented 11 months ago

No, the 2 gpus work collaboratively instead of separately. In short, with model parallel, the weight of embedding and linear layers are split and distributed among multiple gpus, namely each GPU only holds a part of the complete model. The matrix multiplications are also divided into submatrix multiplications correspondingly, and the results are gathered in a way that the computation is equivalent to the classical single-gpu computation. Therefore, while both processes print the complete output, each of which has only conducted a part of the computation necessary for working the result out. That's why you can accommodate the inference on 2 gpus but cannot make it on one.

If you only want to see the output once, you can block the output from ranks other than rank0, whose implementation can be found at https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/util/misc.py#L45

Alpha-VLLM / LLaMA2-Accessory

Questions about SPHINX inference #121