Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.7k stars 170 forks source link

slow inference #149

Open wj210 opened 8 months ago

wj210 commented 8 months ago

Is there anyway large-scale inference can be sped up? I tried the removing the conversion in #4 , which did speed up by a factor of ~3. But it is still substantially slower, than say using text-generation-inference on huggingface models.

Also, does the codebase support multi-gpu inference, where i do inference on multiple batches scattered across devices.

ChrisLiu6 commented 8 months ago

Hi, the following methods are for evaluation/inference:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/6357b07197ee4edbac045ad97a8dcbfce9cfa05c/accessory/model/meta.py#L299-L300

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/6357b07197ee4edbac045ad97a8dcbfce9cfa05c/accessory/model/meta.py#L372-L380

LLaMA2-Accessory supports multi-gpu inference:

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/6357b07197ee4edbac045ad97a8dcbfce9cfa05c/accessory/model/multi_gpu_wrapper.py#L143

which mocks one single complete model but under the hood launches n sub-processes to achieve n-way model parallelism