Open ryanshrott opened 1 year ago
Hi @ryanshrott,
It's possible that most servers for this model are too far away from you geographically, so you have large pings. You'll benefit from more servers hosting this model in your region.
I know that many Llama 2 servers are in North America, so I'd guess you won't have this issue with Llama 2.
I added my RTX 3080 to swarm using: conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia pip install git+https://github.com/bigscience-workshop/petals python -m petals.cli.run_server enoch/llama-65b-hf --adapters timdettmers/guanaco-65b
But I still find my inference to be pretty slow, like 2 mins for 200 tokens. Why is that?