Closed hithqd closed 11 months ago
You can try:
Thanks. But when I use 2 gpus and run the demo with print(response)
, the output is
Does it mean it infers in the 2 gpus seperately? And if I want to output one response, how should I do?
No, the 2 gpus work collaboratively instead of separately. In short, with model parallel, the weight of embedding and linear layers are split and distributed among multiple gpus, namely each GPU only holds a part of the complete model. The matrix multiplications are also divided into submatrix multiplications correspondingly, and the results are gathered in a way that the computation is equivalent to the classical single-gpu computation. Therefore, while both processes print the complete output, each of which has only conducted a part of the computation necessary for working the result out. That's why you can accommodate the inference on 2 gpus but cannot make it on one.
If you only want to see the output once, you can block the output from ranks other than rank0, whose implementation can be found at https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/util/misc.py#L45
Hi, thank you for your work. And I am trying to use the SPHINX inference code you provided. And I use the Single-GPU Inference code in the README.md, but OutOfMemoryError is prompted. How to fix it?