sadaisystems commented 11 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Same or comparable inference speed on a single A100 vs 2 A100 setup.

Current Behavior

GPU inference stats when all two GPUs are available to the inference process (30-60x) slower when compared to a single GPU run:

The best solution i found is to manually hide the second GPU using CUDA_VISIBLE_DEVICES="0".

The model is initialized with main_gpu=0, tensor_split=None. In addition, when all 2 GPUs are visible, tensor_split option doesnt work as expected, since nvidia-smi shows, that both GPUs are used.

Environment and Context

2x A100 GPU server, cuda 12.1, evaluated llama-cpp-python versions: 2.11, 2.13, 2.19 with cuBLAS backend

zpin commented 11 months ago

I have the same issue with 2 x RTX 3090. It looks like the main difference between 1 and 2 GPUs is the prompt eval speed:

0.2.22

2 GPUs:

llama_print_timings:        load time =   57265.05 ms
llama_print_timings:      sample time =     146.24 ms /   898 runs   (    0.16 ms per token,  6140.42 tokens per second)
llama_print_timings: prompt eval time =  332345.40 ms /  2602 tokens (  127.73 ms per token,     7.83 tokens per second)
llama_print_timings:        eval time =  382931.09 ms /   897 runs   (  426.90 ms per token,     2.34 tokens per second)
llama_print_timings:       total time =  716875.28 ms

1 GPU:

llama_print_timings:        load time =    7602.43 ms
llama_print_timings:      sample time =     121.12 ms /   747 runs   (    0.16 ms per token,  6167.28 tokens per second)
llama_print_timings: prompt eval time =   63531.33 ms /  2602 tokens (   24.42 ms per token,    40.96 tokens per second)
llama_print_timings:        eval time =  445772.00 ms /   746 runs   (  597.55 ms per token,     1.67 tokens per second)
llama_print_timings:       total time =  510564.07 ms

I also tried with the latest version, the KV cache did make a big difference but prompt eval speed is still much slower with 2 GPUs.

0.2.24, KV cache on:

2 GPUs:

llama_print_timings:        load time =   58292.49 ms
llama_print_timings:      sample time =     119.42 ms /   753 runs   (    0.16 ms per token,  6305.37 tokens per second)
llama_print_timings: prompt eval time =  296752.33 ms /  2602 tokens (  114.05 ms per token,     8.77 tokens per second)
llama_print_timings:        eval time =  110105.82 ms /   752 runs   (  146.42 ms per token,     6.83 tokens per second)
llama_print_timings:       total time =  408040.80 ms

1 GPU:

llama_print_timings:        load time =    5934.27 ms
llama_print_timings:      sample time =     152.08 ms /   958 runs   (    0.16 ms per token,  6299.19 tokens per second)
llama_print_timings: prompt eval time =   41548.25 ms /  2602 tokens (   15.97 ms per token,    62.63 tokens per second)
llama_print_timings:        eval time =  550712.32 ms /   957 runs   (  575.46 ms per token,     1.74 tokens per second)
llama_print_timings:       total time =  593970.61 ms

abetlen commented 11 months ago

@zpin is it different than the llama.cpp inference speed when you build that from master?

zpin commented 11 months ago

I just built from main, here are the results:

1 GPU:

llama_print_timings:        load time =    5746.74 ms
llama_print_timings:      sample time =     153.75 ms /   967 runs   (    0.16 ms per token,  6289.47 tokens per second)
llama_print_timings: prompt eval time =   40276.57 ms /  2602 tokens (   15.48 ms per token,    64.60 tokens per second)
llama_print_timings:        eval time =  551080.36 ms /   966 runs   (  570.48 ms per token,     1.75 tokens per second)
llama_print_timings:       total time =  593127.22 ms

2 GPUs:

llama_print_timings:        load time =   57944.34 ms
llama_print_timings:      sample time =     175.91 ms /  1100 runs   (    0.16 ms per token,  6253.38 tokens per second)
llama_print_timings: prompt eval time =  295189.21 ms /  2602 tokens (  113.45 ms per token,     8.81 tokens per second)
llama_print_timings:        eval time =  159902.95 ms /  1099 runs   (  145.50 ms per token,     6.87 tokens per second)
llama_print_timings:       total time =  456985.76 ms

This is with a 70B_Q4_K_M.gguf model. I've tried the same model as a 70b-5.0bpw-h6-exl2 with exllamav2 with the same two cards. It's much faster and starts generating immediately: Output generated in 57.99 seconds (14.59 tokens/s, 846 tokens, context 2603, seed 1467096981)

williamgomez71 commented 10 months ago

I have the same issue, when I use only 1 nvidia 4090 I archive 54 tokens per second CUDA_VISIBLE_DEVICES="0".

print_timings: prompt eval time = 176.65 ms / 58 tokens ( 3.05 ms per token, 328.33 tokens per second) print_timings: eval time = 4020.70 ms / 218 runs ( 18.44 ms per token, 54.22 tokens per second) print_timings: total time = 4197.36 ms

when I use 2 nvidia 4090 the tokens by seconds are near to 20 per second CUDA_VISIBLE_DEVICES="0, 1" print_timings: prompt eval time = 672.33 ms / 58 tokens ( 11.59 ms per token, 86.27 tokens per second) print_timings: eval time = 16435.21 ms / 325 runs ( 50.57 ms per token, 19.77 tokens per second) print_timings: total time = 17107.53 ms

when I use 3 nvidia 4090 the tokens by seconds are near 4 per second CUDA_VISIBLE_DEVICES="0, 1, 2" print_timings: prompt eval time = 5154.32 ms / 85 tokens ( 60.64 ms per token, 16.49 tokens per second) print_timings: eval time = 4598.41 ms / 16 runs ( 287.40 ms per token, 3.48 tokens per second) print_timings: total time = 9752.73 ms

As you can see, the results are divided almost by 4 with each additional card I have, I am currently working with Windows 11, I don't know if the operating system has anything to do with it or some type of configuration or parameter with which I should compile llamacpp, thanks for the help.

williamgomez71 commented 10 months ago

@abetlen I compiled from master on 2 January, do you recommend using another branch and test with this other? if this please let me know and Ill compile this other and test again

Im compiling with this flag -DLLAMA_CUBLAS=ON no more. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release

cognitivetech commented 10 months ago

getting 10x slower performance than I did 2 weeks ago.

I have tried many different versions from 0.2.18 and up...

something tells me this problem is not due to llama-cpp itself... because I know those versions were working for me previously, now most of them arent and the ones that are are just very slow.

I am not alone but have a small team examining. whyyy its so slow, now

abetlen / llama-cpp-python

Multiple GPU incredibly slow inference #1026

Prerequisites

Expected Behavior

Current Behavior

Environment and Context