bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
9.12k stars 513 forks source link

How to get faster inference? #414

Open ryanshrott opened 1 year ago

ryanshrott commented 1 year ago

I added my RTX 3080 to swarm using: conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia pip install git+https://github.com/bigscience-workshop/petals python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b

But I still find my inference to be pretty slow, like 2 mins for 200 tokens. Why is that?

borzunov commented 1 year ago

Hi @ryanshrott,

It's possible that most servers for this model are too far away from you geographically, so you have large pings. You'll benefit from more servers hosting this model in your region.

I know that many Llama 2 servers are in North America, so I'd guess you won't have this issue with Llama 2.