abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.13k stars 967 forks source link

Multiple GPU incredibly slow inference #1026

Open sadaisystems opened 11 months ago

sadaisystems commented 11 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Same or comparable inference speed on a single A100 vs 2 A100 setup.

Current Behavior

GPU inference stats when all two GPUs are available to the inference process (30-60x) slower when compared to a single GPU run:

image

The best solution i found is to manually hide the second GPU using CUDA_VISIBLE_DEVICES="0".

The model is initialized with main_gpu=0, tensor_split=None. In addition, when all 2 GPUs are visible, tensor_split option doesnt work as expected, since nvidia-smi shows, that both GPUs are used.

Environment and Context

2x A100 GPU server, cuda 12.1, evaluated llama-cpp-python versions: 2.11, 2.13, 2.19 with cuBLAS backend

zpin commented 11 months ago

I have the same issue with 2 x RTX 3090. It looks like the main difference between 1 and 2 GPUs is the prompt eval speed:

0.2.22

2 GPUs:

llama_print_timings:        load time =   57265.05 ms
llama_print_timings:      sample time =     146.24 ms /   898 runs   (    0.16 ms per token,  6140.42 tokens per second)
llama_print_timings: prompt eval time =  332345.40 ms /  2602 tokens (  127.73 ms per token,     7.83 tokens per second)
llama_print_timings:        eval time =  382931.09 ms /   897 runs   (  426.90 ms per token,     2.34 tokens per second)
llama_print_timings:       total time =  716875.28 ms

1 GPU:

llama_print_timings:        load time =    7602.43 ms
llama_print_timings:      sample time =     121.12 ms /   747 runs   (    0.16 ms per token,  6167.28 tokens per second)
llama_print_timings: prompt eval time =   63531.33 ms /  2602 tokens (   24.42 ms per token,    40.96 tokens per second)
llama_print_timings:        eval time =  445772.00 ms /   746 runs   (  597.55 ms per token,     1.67 tokens per second)
llama_print_timings:       total time =  510564.07 ms

I also tried with the latest version, the KV cache did make a big difference but prompt eval speed is still much slower with 2 GPUs.

0.2.24, KV cache on:

2 GPUs:

llama_print_timings:        load time =   58292.49 ms
llama_print_timings:      sample time =     119.42 ms /   753 runs   (    0.16 ms per token,  6305.37 tokens per second)
llama_print_timings: prompt eval time =  296752.33 ms /  2602 tokens (  114.05 ms per token,     8.77 tokens per second)
llama_print_timings:        eval time =  110105.82 ms /   752 runs   (  146.42 ms per token,     6.83 tokens per second)
llama_print_timings:       total time =  408040.80 ms

1 GPU:

llama_print_timings:        load time =    5934.27 ms
llama_print_timings:      sample time =     152.08 ms /   958 runs   (    0.16 ms per token,  6299.19 tokens per second)
llama_print_timings: prompt eval time =   41548.25 ms /  2602 tokens (   15.97 ms per token,    62.63 tokens per second)
llama_print_timings:        eval time =  550712.32 ms /   957 runs   (  575.46 ms per token,     1.74 tokens per second)
llama_print_timings:       total time =  593970.61 ms
abetlen commented 11 months ago

@zpin is it different than the llama.cpp inference speed when you build that from master?

zpin commented 11 months ago

I just built from main, here are the results:

1 GPU:

llama_print_timings:        load time =    5746.74 ms
llama_print_timings:      sample time =     153.75 ms /   967 runs   (    0.16 ms per token,  6289.47 tokens per second)
llama_print_timings: prompt eval time =   40276.57 ms /  2602 tokens (   15.48 ms per token,    64.60 tokens per second)
llama_print_timings:        eval time =  551080.36 ms /   966 runs   (  570.48 ms per token,     1.75 tokens per second)
llama_print_timings:       total time =  593127.22 ms

2 GPUs:

llama_print_timings:        load time =   57944.34 ms
llama_print_timings:      sample time =     175.91 ms /  1100 runs   (    0.16 ms per token,  6253.38 tokens per second)
llama_print_timings: prompt eval time =  295189.21 ms /  2602 tokens (  113.45 ms per token,     8.81 tokens per second)
llama_print_timings:        eval time =  159902.95 ms /  1099 runs   (  145.50 ms per token,     6.87 tokens per second)
llama_print_timings:       total time =  456985.76 ms

This is with a 70B_Q4_K_M.gguf model. I've tried the same model as a 70b-5.0bpw-h6-exl2 with exllamav2 with the same two cards. It's much faster and starts generating immediately: Output generated in 57.99 seconds (14.59 tokens/s, 846 tokens, context 2603, seed 1467096981)

williamgomez71 commented 10 months ago

I have the same issue, when I use only 1 nvidia 4090 I archive 54 tokens per second CUDA_VISIBLE_DEVICES="0".

print_timings: prompt eval time = 176.65 ms / 58 tokens ( 3.05 ms per token, 328.33 tokens per second) print_timings: eval time = 4020.70 ms / 218 runs ( 18.44 ms per token, 54.22 tokens per second) print_timings: total time = 4197.36 ms

when I use 2 nvidia 4090 the tokens by seconds are near to 20 per second CUDA_VISIBLE_DEVICES="0, 1" print_timings: prompt eval time = 672.33 ms / 58 tokens ( 11.59 ms per token, 86.27 tokens per second) print_timings: eval time = 16435.21 ms / 325 runs ( 50.57 ms per token, 19.77 tokens per second) print_timings: total time = 17107.53 ms

when I use 3 nvidia 4090 the tokens by seconds are near 4 per second CUDA_VISIBLE_DEVICES="0, 1, 2" print_timings: prompt eval time = 5154.32 ms / 85 tokens ( 60.64 ms per token, 16.49 tokens per second) print_timings: eval time = 4598.41 ms / 16 runs ( 287.40 ms per token, 3.48 tokens per second) print_timings: total time = 9752.73 ms

As you can see, the results are divided almost by 4 with each additional card I have, I am currently working with Windows 11, I don't know if the operating system has anything to do with it or some type of configuration or parameter with which I should compile llamacpp, thanks for the help.

williamgomez71 commented 10 months ago

@abetlen I compiled from master on 2 January, do you recommend using another branch and test with this other? if this please let me know and Ill compile this other and test again

Im compiling with this flag -DLLAMA_CUBLAS=ON no more. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release

cognitivetech commented 10 months ago

getting 10x slower performance than I did 2 weeks ago.

I have tried many different versions from 0.2.18 and up...

something tells me this problem is not due to llama-cpp itself... because I know those versions were working for me previously, now most of them arent and the ones that are are just very slow.

I am not alone but have a small team examining. whyyy its so slow, now