Tensor parallel support for multi GPU

ilookee commented 1 month ago

Hello, I'm not sure if multi GPU is supported yet. I didn't find parameters for tensor parallel, and the "num_device_layers" parameter seems not work. Please let me know if it supports or has plans to support multi GPU. Thanks for your awesome work!

EricLBuehler commented 1 month ago

Hi @ilookee! Cross GPU device mapping is supported. Each element follows the format ORD:NUM where ORD is the device ordinal and NUM is the corresponding number of layers and delimited by commas.

What we do not support yet is true multi GPU inference. This would be done with something like nccl and we do have plans to implement this.

EricLBuehler commented 1 month ago

@oldgithubman is this what you had in mind?

oldgithubman commented 1 month ago

@oldgithubman is this what you had in mind?

If you're referring to my requests for distributed inference, I mean the ability to use multiple GPU's, CPU's (RAM), and nodes (LAN). Basically, be able to use all the hardware at my disposal. I'm already mostly able to do this with llama.cpp (except windows). It basically lets me run models 2-3x larger than otherwise, so I'd say this is the biggest advantage llama.cpp has at the moment

murtaza-nasir commented 1 month ago

Hi @ilookee! Cross GPU device mapping is supported. Each element follows the format ORD:NUM where ORD is the device ordinal and NUM is the corresponding number of layers and delimited by commas.

What we do not support yet is true multi GPU inference. This would be done with something like nccl and we do have plans to implement this.

Does this mean that I can't load a 70B across multiple 3090s and use it for inferencing?

EricLBuehler commented 1 month ago

Hey @murtaza-nasir! You can absolutely do this, just use the format ORDINAL:NLAYERS,...

Please let me know I can provide any help.

murtaza-nasir commented 1 month ago

Thank you for your quick response! I'm trying to figure out how to use all the various model files I have with mistral.rs. I can't find any examples of loading large models with different arguments. Can you give me an example of how I can load a model /home/user/work/ml/models/meta-llama_Meta-Llama-3.1-70B-Instruct with ISQ across four 3090 GPUs?

But even before this, I am trying to run a gguf of an 8B in a single GPU and I must be doing something very stupid. I'm trying to follow the instructions to start a server. After succesfully installing from source, I run:

./mistralrs_server -i plain -m /home/user/work/ml/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -a llama

But I get:

-bash: ./mistralrs_server: No such file or directory

murtaza-nasir commented 1 month ago

OK sorry I got it to work, but now I can't figure out how to specify the GPUs? Can't find documentation on what argument sets this.

mistralrs-server --chat-template /home/user/work/ml/mistral.rs/chat_templates/llama3.json gguf  -m /home/user/work/ml/models/ -f Meta-Llama-3.1-70B-Instruct.Q5_K_S.gguf -a llama

EDIT: This doesnt work:

mistralrs-server --chat-template /home/murtaza/work/ml/mistral.rs/chat_templates/llama3.json gguf -m /home/murtaza/work/ml/text-generation-webui/models/ -f Meta-Llama-3.1-70B-Instruct.Q5_K_S.gguf -n "0:20;1:20;2:20;3:20"

oldgithubman commented 1 month ago

Thank you for your quick response! I'm trying to figure out how to use all the various model files I have with mistral.rs. I can't find any examples of loading large models with different arguments. Can you give me an example of how I can load a model /home/user/work/ml/models/meta-llama_Meta-Llama-3.1-70B-Instruct with ISQ across four 3090 GPUs?

But even before this, I am trying to run a gguf of an 8B in a single GPU and I must be doing something very stupid. I'm trying to follow the instructions to start a server. After succesfully installing from source, I run:
./mistralrs_server -i plain -m /home/user/work/ml/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -a llama
But I get:
-bash: ./mistralrs_server: No such file or directory

I agree the readme is a little confusing. Make suggestions!

EricLBuehler commented 1 month ago

@oldgithubman @murtaza-nasir

The following should work:

mistralrs-server --chat-template /home/murtaza/work/ml/mistral.rs/chat_templates/llama3.json -n "0:20;1:20;2:20;3:20" gguf -m /home/murtaza/work/ml/text-generation-webui/models/ -f Meta-Llama-3.1-70B-Instruct.Q5_K_S.gguf

Notice how the -n goes before gguf, instead of after. You can see this if you run mistralrs-server --help.

oldgithubman commented 1 month ago

@oldgithubman @murtaza-nasir

The following should work:
mistralrs-server --chat-template /home/murtaza/work/ml/mistral.rs/chat_templates/llama3.json -n "0:20;1:20;2:20;3:20" gguf -m /home/murtaza/work/ml/text-generation-webui/models/ -f Meta-Llama-3.1-70B-Instruct.Q5_K_S.gguf 
Notice how the -n goes before gguf, instead of after. You can see this if you run mistralrs-server --help.

To be fair, it can be ambiguous/confusing when some arguments require order. Any progress on distributed (LAN)?

EricLBuehler / mistral.rs

Tensor parallel support for multi GPU #617