Closed BarfingLemurs closed 9 months ago
I found this does work in other systems like a pi. Closing as this would be an environment issue
However: On mobile: batch size 1 = 4 t/s batch size 2 = 4 t/s batch size 4 = 4 t/s
My understanding is that when this happens, it means there is not enough compute to saturate the memory bandwidth. Here are some more results with AWS instances that behave similarly:
https://github.com/ggerganov/llama.cpp/issues/3478#issuecomment-1829391579
Thank you for the notice!
@ggerganov On further testing, I was able to see some gain with smaller models on android, through termux.
batchsize 2 = 16.9 t/s batchsize 1 = 14.5 t/s
batchsize 2 = 9.53 t/s batchsize 1 = 8.1 t/s
batchsize 2 = 7.2 t/s batchsize 1 = 6.9 t/s
with raspberry pi 400, I can double the total tokens (4 t/s -> 8 t/s) with tinyllama Q4_K_M.
In both cases, the cpu is at its limit, a batch size of 3 or 4 did not improve anything further.
I was thinking the chip on Pixel 6 would have greater compute. Maybe the model runs too fast to do anything in parallel.
I was thinking the chip on Pixel 6 would have greater compute
@BarfingLemurs the pixel 6 pro CPU is a heterogeneous system, with 3 types of cores: 2 ultra-fast cortex-x1, 2 cortex-a76, and 4 slow but low-power cortex-a55. The raspberry pis all have homogeneous CPUs, so most likely the difference you are observing is due to some of the cores in the pixel waiting on the slow a55s. BLAS won't be used for batch sizes smaller than 32, so the processing will all be done in llama.cpp directly. Therefore, you can try to tune the thread count, if you set the threads-batch parameter to 2 you may see greater speedups
@AutonomicPerfectionist
I get worse speeds with -t 2. 4 is still best for my device
./parallel -m ~/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf -ns 2 -np 2 -p "what is a llama?" -t 4 -n 30
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Motivation
The batched and parallel examples do not perform as expected. The examples normally demonstrate the tps scales with batch size:
Possible Implementation
Maybe this is implemented, but not working in the environment I tested.