Open sadaisystems opened 11 months ago
I have the same issue with 2 x RTX 3090. It looks like the main difference between 1 and 2 GPUs is the prompt eval speed:
0.2.22
2 GPUs:
llama_print_timings: load time = 57265.05 ms
llama_print_timings: sample time = 146.24 ms / 898 runs ( 0.16 ms per token, 6140.42 tokens per second)
llama_print_timings: prompt eval time = 332345.40 ms / 2602 tokens ( 127.73 ms per token, 7.83 tokens per second)
llama_print_timings: eval time = 382931.09 ms / 897 runs ( 426.90 ms per token, 2.34 tokens per second)
llama_print_timings: total time = 716875.28 ms
1 GPU:
llama_print_timings: load time = 7602.43 ms
llama_print_timings: sample time = 121.12 ms / 747 runs ( 0.16 ms per token, 6167.28 tokens per second)
llama_print_timings: prompt eval time = 63531.33 ms / 2602 tokens ( 24.42 ms per token, 40.96 tokens per second)
llama_print_timings: eval time = 445772.00 ms / 746 runs ( 597.55 ms per token, 1.67 tokens per second)
llama_print_timings: total time = 510564.07 ms
I also tried with the latest version, the KV cache did make a big difference but prompt eval speed is still much slower with 2 GPUs.
0.2.24, KV cache on:
2 GPUs:
llama_print_timings: load time = 58292.49 ms
llama_print_timings: sample time = 119.42 ms / 753 runs ( 0.16 ms per token, 6305.37 tokens per second)
llama_print_timings: prompt eval time = 296752.33 ms / 2602 tokens ( 114.05 ms per token, 8.77 tokens per second)
llama_print_timings: eval time = 110105.82 ms / 752 runs ( 146.42 ms per token, 6.83 tokens per second)
llama_print_timings: total time = 408040.80 ms
1 GPU:
llama_print_timings: load time = 5934.27 ms
llama_print_timings: sample time = 152.08 ms / 958 runs ( 0.16 ms per token, 6299.19 tokens per second)
llama_print_timings: prompt eval time = 41548.25 ms / 2602 tokens ( 15.97 ms per token, 62.63 tokens per second)
llama_print_timings: eval time = 550712.32 ms / 957 runs ( 575.46 ms per token, 1.74 tokens per second)
llama_print_timings: total time = 593970.61 ms
@zpin is it different than the llama.cpp inference speed when you build that from master?
I just built from main, here are the results:
1 GPU:
llama_print_timings: load time = 5746.74 ms
llama_print_timings: sample time = 153.75 ms / 967 runs ( 0.16 ms per token, 6289.47 tokens per second)
llama_print_timings: prompt eval time = 40276.57 ms / 2602 tokens ( 15.48 ms per token, 64.60 tokens per second)
llama_print_timings: eval time = 551080.36 ms / 966 runs ( 570.48 ms per token, 1.75 tokens per second)
llama_print_timings: total time = 593127.22 ms
2 GPUs:
llama_print_timings: load time = 57944.34 ms
llama_print_timings: sample time = 175.91 ms / 1100 runs ( 0.16 ms per token, 6253.38 tokens per second)
llama_print_timings: prompt eval time = 295189.21 ms / 2602 tokens ( 113.45 ms per token, 8.81 tokens per second)
llama_print_timings: eval time = 159902.95 ms / 1099 runs ( 145.50 ms per token, 6.87 tokens per second)
llama_print_timings: total time = 456985.76 ms
This is with a 70B_Q4_K_M.gguf model. I've tried the same model as a 70b-5.0bpw-h6-exl2 with exllamav2 with the same two cards. It's much faster and starts generating immediately: Output generated in 57.99 seconds (14.59 tokens/s, 846 tokens, context 2603, seed 1467096981)
I have the same issue, when I use only 1 nvidia 4090 I archive 54 tokens per second CUDA_VISIBLE_DEVICES="0".
print_timings: prompt eval time = 176.65 ms / 58 tokens ( 3.05 ms per token, 328.33 tokens per second) print_timings: eval time = 4020.70 ms / 218 runs ( 18.44 ms per token, 54.22 tokens per second) print_timings: total time = 4197.36 ms
when I use 2 nvidia 4090 the tokens by seconds are near to 20 per second CUDA_VISIBLE_DEVICES="0, 1" print_timings: prompt eval time = 672.33 ms / 58 tokens ( 11.59 ms per token, 86.27 tokens per second) print_timings: eval time = 16435.21 ms / 325 runs ( 50.57 ms per token, 19.77 tokens per second) print_timings: total time = 17107.53 ms
when I use 3 nvidia 4090 the tokens by seconds are near 4 per second CUDA_VISIBLE_DEVICES="0, 1, 2" print_timings: prompt eval time = 5154.32 ms / 85 tokens ( 60.64 ms per token, 16.49 tokens per second) print_timings: eval time = 4598.41 ms / 16 runs ( 287.40 ms per token, 3.48 tokens per second) print_timings: total time = 9752.73 ms
As you can see, the results are divided almost by 4 with each additional card I have, I am currently working with Windows 11, I don't know if the operating system has anything to do with it or some type of configuration or parameter with which I should compile llamacpp, thanks for the help.
@abetlen I compiled from master on 2 January, do you recommend using another branch and test with this other? if this please let me know and Ill compile this other and test again
Im compiling with this flag -DLLAMA_CUBLAS=ON no more. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build cd build cmake .. -DLLAMA_CUBLAS=ON cmake --build . --config Release
getting 10x slower performance than I did 2 weeks ago.
I have tried many different versions from 0.2.18 and up...
something tells me this problem is not due to llama-cpp itself... because I know those versions were working for me previously, now most of them arent and the ones that are are just very slow.
I am not alone but have a small team examining. whyyy its so slow, now
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Same or comparable inference speed on a single A100 vs 2 A100 setup.
Current Behavior
GPU inference stats when all two GPUs are available to the inference process (30-60x) slower when compared to a single GPU run:
The best solution i found is to manually hide the second GPU using CUDA_VISIBLE_DEVICES="0".
The model is initialized with main_gpu=0, tensor_split=None. In addition, when all 2 GPUs are visible, tensor_split option doesnt work as expected, since nvidia-smi shows, that both GPUs are used.
Environment and Context
2x A100 GPU server, cuda 12.1, evaluated llama-cpp-python versions: 2.11, 2.13, 2.19 with cuBLAS backend