abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.12k stars 964 forks source link

Performance degradation running with half a socket in CPU system #1098

Open azhuvath opened 10 months ago

azhuvath commented 10 months ago

Given a small context, a paragraph with less than 100 words, we are trying to answer a query. There are 5 such queries and overall time taken is recorded. This experiment is conducted on a full system, full socket (in a dual socket machine), and half a socket in the machine (Purely CPU's with no accelerators). Seeing a strange behavior in which the performance degrades considerably scaling down from full socket to half a socket.

I conducted the same experiment using Intel Extension for PyTorch (IPEX). But I don't see the performance degradation moving from full socket to half a socket. Attaching the graphs to better understand the strange behavior.

Note: Could not do the IPEX experiment in 9480 due to some local issues. All the timings are average of 5 runs.

System Details 8380 - Intel® Xeon® Platinum 8380 Processor 8480 - Intel® Xeon® Platinum 8480+ Processor 9490 - Intel® Xeon® CPU Max 9480 Processor

Performance observed using Llama CPP with three different systems (8380, 8480, & 9480) image

Performance observed using IPEX with two different system (8380, 8480) image

You can see from the above graphs that moving from full socket to half socket has huge impact with Llama CPP where as it has very less impact with IPEX. Any ideas why this is happening with Llama CPP and not with IPEX?

I did capture system details while the executing the experiments using VTune Application Performance Snapshot (APS) tool. The elapsed time and the graph times differ because elapsed time include model loading and other activities. Attaching APS snapshot.

Full System image

Full Socket image

Half Socket image

Not sure why the DRAM bandwidth considerably reduced when it is Half Socket for Llama CPP. This behavior is not seen with IPEX and graph clearly shows that the impact of moving from Full Socket to Half Socket is very gradual in IPEX, but not in Llama CPP.

abetlen commented 9 months ago

@azhuvath are you experiencing the same performance issue with llama.cpp standalone?

azhuvath commented 9 months ago

@azhuvath are you experiencing the same performance issue with llama.cpp standalone?

@abetlen I am using it through pip install llama-cpp-python

What should be the steps I need to follow?

azhuvath commented 9 months ago

I tried setting OMP_NUM_THREADS to see if over thread subscription can be reduced. It was not helping much. Then I set the n_threads parameter which improved the performance. The degradation from full socket cores to half socket is still not smooth. I do not see this problem with IPEX, OpenVINO, etc.

Config1 - 112 Cores/Stream - 1 Stream Config2 - 56 Cores/Stream - 2 Concurrent Streams - NUMA pinned Config3 - 28 Cores/Stream - 4 Concurrent Streams - NUMA pinned Config4 - 14 Cores/Stream - 8 Concurrent Streams - NUMA pinned