ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.74k stars 8.84k forks source link

Performance decreated between tag b1500 and b2581 on Windows ARM64 PC #6417

Closed Billzhong2022 closed 1 week ago

Billzhong2022 commented 3 months ago

Hi LLAMA team,

I use llama tag b2581 on Windows ARM64 PC, the performance is more lower than previous tag b1500. Please refer to below detailed information. What is the reason? Please help on this issue.

Thanks a lot!

[Detailed information]

Command: main.exe -m llama-2-7b-chat.ggufv3.q4_0.bin --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 10

Prompt: I have 3 years of experience as a software developer. Now I got bored with coding and want to transition to another career. My education qualifications are B. Tech in computer science, and I am well-versed in understanding the business side of software as well. Suggest a list of career options that are easy for me to transition.

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

Tag b1500 results: llama_print_timings: load time = 723.53 ms llama_print_timings: sample time = 925.29 ms / 624 runs ( 1.48 ms per token, 674.38 tokens per second) llama_print_timings: prompt eval time = 2583.12 ms / 91 tokens ( 28.39 ms per token, 35.23 tokens per second) llama_print_timings: eval time = 31693.17 ms / 625 runs ( 50.71 ms per token, 19.72 tokens per second) llama_print_timings: total time = 51797.58 ms

Tag b2581 results: llama_print_timings: load time = 963.25 ms llama_print_timings: sample time = 416.14 ms / 586 runs ( 0.71 ms per token, 1408.17 tokens per second) llama_print_timings: prompt eval time = 11847.94 ms / 94 tokens ( 126.04 ms per token, 7.93 tokens per second) llama_print_timings: eval time = 68542.50 ms / 585 runs ( 117.17 ms per token, 8.53 tokens per second) llama_print_timings: total time = 82696.57 ms / 679 tokens

zhanweiw commented 1 month ago

Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results.

But why without OpenBlas, we can get good performance(32 tokens/s) for 'prompt eval time'?

ReinForce-II commented 1 month ago

Well, the platform may not providing much fp32 arithmetic power. How about 8+4, 8 on blas library. We'd better use llama-bench.exe to get some more detailed results.

But why without OpenBlas, we can get good performance(32 tokens/s) for 'prompt eval time'?

Without OpenBlas, you are running dot product in quantized operations, not in fp32.

zhanweiw commented 1 month ago

Got it. Thanks so much!

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.