c0sogi / llama-api

An OpenAI-like LLaMA inference API
MIT License
111 stars 9 forks source link

exllama GPU split #21

Open atisharma opened 1 year ago

atisharma commented 1 year ago

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.

atisharma commented 1 year ago

For future readers, it can be done by adding the following line in model_definitions.py (e.g. to split a 70B model over two cards):

    auto_map=[17.5, 22],