exllama GPU split - Githubissues

c0sogi / llama-api

An OpenAI-like LLaMA inference API

MIT License

111 stars 9 forks source link

Open atisharma opened 1 year ago

atisharma commented 1 year ago

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.

atisharma commented 1 year ago

For future readers, it can be done by adding the following line in model_definitions.py (e.g. to split a 70B model over two cards):

    auto_map=[17.5, 22],