Open wj210 opened 10 months ago
Hi, the following methods are for evaluation/inference:
LLaMA2-Accessory supports multi-gpu inference:
If the model is not split by model parallel (in LLaMA2-Accessory model parallel refers to tensor parallel/tensor slicing), the code is nothing special compared to normal distributed evaluation scripts. Given 8 GPUs, you may use torchrun
or similar things to launch 8 processes, each of which corresponds to a GPU. Suppose the batch size of each process is bsz, the final effective batch size would be 8 * bsz.
If model parallel is needed, e.g. when the model is too large to serve on a single GPU, or when you want to lower the latency by parallelizing the matrix multiplications among multiple GPUs, It is okay to organize the 8 GPUs into, for example, data parallel size = 4 and model parallel size = 2. You need to:
Finally, suppose the batch size on each GPU is bsz, you finally get effective batch size 4 *bsz
In all the cases above, the relationship between GPU and process is one-to-one. On the other hand, if you want to use LLaMA2-Accessory models in the single-process-multi-gpu pattern, you may try the MultiGPUWrapper,
which mocks one single complete model but under the hood launches n sub-processes to achieve n-way model parallelism
Is there anyway large-scale inference can be sped up? I tried the removing the conversion in #4 , which did speed up by a factor of ~3. But it is still substantially slower, than say using text-generation-inference on huggingface models.
Also, does the codebase support multi-gpu inference, where i do inference on multiple batches scattered across devices.