slow inference - Githubissues

Hi, the following methods are for evaluation/inference:

LLaMA2-Accessory supports multi-gpu inference:

If the model is not split by model parallel (in LLaMA2-Accessory model parallel refers to tensor parallel/tensor slicing), the code is nothing special compared to normal distributed evaluation scripts. Given 8 GPUs, you may use torchrun or similar things to launch 8 processes, each of which corresponds to a GPU. Suppose the batch size of each process is bsz, the final effective batch size would be 8 * bsz.
If model parallel is needed, e.g. when the model is too large to serve on a single GPU, or when you want to lower the latency by parallelizing the matrix multiplications among multiple GPUs, It is okay to organize the 8 GPUs into, for example, data parallel size = 4 and model parallel size = 2. You need to:
1. use fs_init to divide the world into data parallel and model parallel groups
2. specify the model parallel group when creating the model
3. design the data sampler so that ranks in the same model parallel group always receive the same data in each iteration.
Finally, suppose the batch size on each GPU is bsz, you finally get effective batch size 4 *bsz
In all the cases above, the relationship between GPU and process is one-to-one. On the other hand, if you want to use LLaMA2-Accessory models in the single-process-multi-gpu pattern, you may try the MultiGPUWrapper,

which mocks one single complete model but under the hood launches n sub-processes to achieve n-way model parallelism

Alpha-VLLM / LLaMA2-Accessory