BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference
MIT License
631 stars 39 forks source link

Would it suitable for the multi-GPU parallel inference for llama2? #118

Open aclie opened 1 year ago

aclie commented 1 year ago

Hi, I've built a chatbot using Llama2 on a machine equipped with four GPUs, each with 16GB of memory. However, it appears that only 'cuda:0' is currently being utilized. Consequently, we are experiencing high latency, approximately 60 seconds per question. I'm wondering if Tensor Parallel can help us leverage the other CUDA devices. I've attempted the following:

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True,\
low_cpu_mem_usage=True, \
torch_dtype=torch.float16,\
load_in_4bit=True)
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])

Please let me know if you have any suggestions or advice. Thanks in advance!