ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.93k stars 9.47k forks source link

I am running two socket servers, and the CPU usage is at 50% #7812

Closed superLiben closed 2 months ago

superLiben commented 4 months ago

What happened?

I am running GMME 7B and see the CPU usage at 50%. How can I increase the usage to 100%? I want to see the number of performance tokens per second at the CPU's maximum MHz image aa bb

Name and Version

[root@localhost llama.cpp]# ./main --version version: 3104 (a5cabd76) built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) for x86_64-redhat-linux [root@localhost llama.cpp]#

What operating system are you seeing the problem on?

No response

Relevant log output

No response

superLiben commented 4 months ago

I am using an Ampere(R) Altra(R) Max Processor M128-30 CPU @ 3.0GHz single-socket server, capable of running all CPUs. Does llama.cpp not support cross-socket? cc dd

Rotatingxenomorph commented 4 months ago

On CPU the memory bandwidth limits token/s. In most cases 100% cpu during inference would mean something was wrong and probably will be giving worse tokens/s. A good starting point is for the -t to be set at one less than the number of physical cores and test from there.

Although someone did find that they got better performance on a dual cpu system when turning off hyperthreading, which is weird. https://www.reddit.com/r/LocalLLaMA/comments/1cl278t/if_you_are_using_cpu_this_one_simple_trick_will/

jrichey98 commented 3 months ago

Does llama.cpp not support cross-socket?

It does support cross-socket fine. I run it on E5-2667v2 CPUs. They are memory bandwidth limited, not CPU limited (8-ch DDR3 1866).

On CPU the memory bandwidth limits token/s. In most cases 100% cpu during inference would mean something was wrong and probably will be giving worse tokens/s. A good starting point is for the -t to be set at one less than the number of physical cores and test from there.

I have ran llama.cpp on a 5950X and a 2xE5-2667v2 (2.5x performance of the 5950X) system. I've found that it's memory bandwidth limited usually (2x2666=5332 vs 8x1866=14928).

You just have to test and see how many tokens/s you get with each setting for your system. But likely the issue isn't CPU, it's memory bandwidth.

Make sure to specify the optimal number of threads, and I know at lest on dual socket X86 systems the --numa option is very helpful.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.