Open azhuvath opened 10 months ago
@azhuvath are you experiencing the same performance issue with llama.cpp standalone?
@azhuvath are you experiencing the same performance issue with llama.cpp standalone?
@abetlen I am using it through pip install llama-cpp-python
What should be the steps I need to follow?
I tried setting OMP_NUM_THREADS to see if over thread subscription can be reduced. It was not helping much. Then I set the n_threads parameter which improved the performance. The degradation from full socket cores to half socket is still not smooth. I do not see this problem with IPEX, OpenVINO, etc.
Config1 - 112 Cores/Stream - 1 Stream Config2 - 56 Cores/Stream - 2 Concurrent Streams - NUMA pinned Config3 - 28 Cores/Stream - 4 Concurrent Streams - NUMA pinned Config4 - 14 Cores/Stream - 8 Concurrent Streams - NUMA pinned
Given a small context, a paragraph with less than 100 words, we are trying to answer a query. There are 5 such queries and overall time taken is recorded. This experiment is conducted on a full system, full socket (in a dual socket machine), and half a socket in the machine (Purely CPU's with no accelerators). Seeing a strange behavior in which the performance degrades considerably scaling down from full socket to half a socket.
I conducted the same experiment using Intel Extension for PyTorch (IPEX). But I don't see the performance degradation moving from full socket to half a socket. Attaching the graphs to better understand the strange behavior.
Note: Could not do the IPEX experiment in 9480 due to some local issues. All the timings are average of 5 runs.
System Details 8380 - Intel® Xeon® Platinum 8380 Processor 8480 - Intel® Xeon® Platinum 8480+ Processor 9490 - Intel® Xeon® CPU Max 9480 Processor
Performance observed using Llama CPP with three different systems (8380, 8480, & 9480)
Performance observed using IPEX with two different system (8380, 8480)
You can see from the above graphs that moving from full socket to half socket has huge impact with Llama CPP where as it has very less impact with IPEX. Any ideas why this is happening with Llama CPP and not with IPEX?
I did capture system details while the executing the experiments using VTune Application Performance Snapshot (APS) tool. The elapsed time and the graph times differ because elapsed time include model loading and other activities. Attaching APS snapshot.
Full System
Full Socket
Half Socket
Not sure why the DRAM bandwidth considerably reduced when it is Half Socket for Llama CPP. This behavior is not seen with IPEX and graph clearly shows that the impact of moving from Full Socket to Half Socket is very gradual in IPEX, but not in Llama CPP.