serving with multi-GPU - Githubissues

Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

https://lightning.ai

Apache License 2.0

9.66k stars 963 forks source link

serving with multi-GPU #1482

Open richardzhuang0412 opened 2 months ago

richardzhuang0412 commented 2 months ago

I was testing "litgpt serve" for llama-3-70b using 4 A100 80G and I receive OOM error. I tried the same command on llama-2-13b and it seems like specifying the "devices" argument only load multiple replicas of the same model but not distributing the memory. Is there any way to do multi-gpu serving with the model?

rasbt commented 2 months ago

Unfortunately, multi-GPU inference is not supported yet, but that's something on the roadmap.

awaelchli commented 2 months ago

There is a generate script that uses tensor parallel for inference on multi-GPU. You could adapt this one for serving.