ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.05k stars 9.33k forks source link

[Feature Request] parallel decoding on mobile #4064

Closed BarfingLemurs closed 9 months ago

BarfingLemurs commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

Motivation

The batched and parallel examples do not perform as expected. The examples normally demonstrate the tps scales with batch size:

On an x86 cpu:
batch size 1 = 4 t/s
batch size 2 = 8 t/s
batch size 4 = 16 t/s

However:

On mobile:
batch size 1 = 4 t/s
batch size 2 = 4 t/s
batch size 4 = 4 t/s

Possible Implementation

Maybe this is implemented, but not working in the environment I tested.

BarfingLemurs commented 9 months ago

I found this does work in other systems like a pi. Closing as this would be an environment issue

ggerganov commented 9 months ago

However: On mobile: batch size 1 = 4 t/s batch size 2 = 4 t/s batch size 4 = 4 t/s

My understanding is that when this happens, it means there is not enough compute to saturate the memory bandwidth. Here are some more results with AWS instances that behave similarly:

https://github.com/ggerganov/llama.cpp/issues/3478#issuecomment-1829391579

BarfingLemurs commented 9 months ago

Thank you for the notice!

@ggerganov On further testing, I was able to see some gain with smaller models on android, through termux.

Pixel 6 Pro

Q4_K_M tinyllama

batchsize 2 = 16.9 t/s batchsize 1 = 14.5 t/s

Q4_0 tinyllama

batchsize 2 = 9.53 t/s batchsize 1 = 8.1 t/s

f16 version, the difference is less noticeable now:

batchsize 2 = 7.2 t/s batchsize 1 = 6.9 t/s

with raspberry pi 400, I can double the total tokens (4 t/s -> 8 t/s) with tinyllama Q4_K_M.

In both cases, the cpu is at its limit, a batch size of 3 or 4 did not improve anything further.

I was thinking the chip on Pixel 6 would have greater compute. Maybe the model runs too fast to do anything in parallel.

AutonomicPerfectionist commented 9 months ago

I was thinking the chip on Pixel 6 would have greater compute

@BarfingLemurs the pixel 6 pro CPU is a heterogeneous system, with 3 types of cores: 2 ultra-fast cortex-x1, 2 cortex-a76, and 4 slow but low-power cortex-a55. The raspberry pis all have homogeneous CPUs, so most likely the difference you are observing is due to some of the cores in the pixel waiting on the slow a55s. BLAS won't be used for batch sizes smaller than 32, so the processing will all be done in llama.cpp directly. Therefore, you can try to tune the thread count, if you set the threads-batch parameter to 2 you may see greater speedups

BarfingLemurs commented 9 months ago

@AutonomicPerfectionist

I get worse speeds with -t 2. 4 is still best for my device

./parallel -m ~/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf -ns 2 -np 2 -p "what is a llama?" -t 4 -n 30