Closed ghost closed 1 year ago
Right now there is no way to use OpenGL or Vulkan in llama.cpp.
Understood. Thank you!
What is theoretical performance achievable on state-of-the-art mobile soc like exynos2200 or snapdragon 8 gen utilizing all resources ,i.e CPU GPU dsp, (assuming sufficient ddr5 memory available)? ~ 1.5 t/s currently reported on poco f3 or s22, is 4x speedup possible for a 7b model?
What is theoretical performance achievable on state-of-the-art mobile soc like exynos2200 or snapdragon 8 gen (assuming sufficient ddr5 memory available)? ~ 1.5 t/s currently reported on poco f3 or s22, is 3 - 4 speedup achievable for a 7b model?
With 7B models, OpenBlas print evals around 250ms, and print timings around 330ms is typical for my device (3 t/s), so I figure the devices you mentioned are faster if properly configured.
It's difficult to guess what's possible with a fully supported GPU since it's theoretical, maybe 5 t/s. It could be more, like 10 t/s, but I'm just guessing.
edit: the new t/s print is nice:
llama_print_timings: load time = 859.46 ms
llama_print_timings: sample time = 1254.50 ms / 535 runs ( 2.34 ms per token, 426.46 tokens per second)
llama_print_timings: prompt eval time = 106083.06 ms / 466 tokens ( 227.65 ms per token, 4.39 tokens per second)
llama_print_timings: eval time = 169735.84 ms / 537 runs ( 316.08 ms per token, 3.16 tokens per second)
llama_print_timings: total time = 831259.84 ms
What is theoretical performance achievable on state-of-the-art mobile soc like exynos2200 or snapdragon 8 gen utilizing all resources ,i.e CPU GPU dsp, (assuming sufficient ddr5 memory available)?
This is impossible to answer. I guess you could estimate something with the known FLOPS performance characteristics, but llama.cpp cannot use GPU and CPU at the same time fully, and it works best if using one one single type of performance core.
For example on my Pinebook Pro (RK3399) today I tested 3B and it gets almost the same speed if I use 4 A-53 cores or 2 A-72 cores, but if I try to use all of them it is much slower. So just by only using the performance cores, most of the CPU cores are not even used.
I have an SBC with the new RK3588S as well, and this one can generate 7B on its four A-76 cores at around 3.3 t/s. Using the four Cortex A-55 cores it is 0.8 t/s, using all cores 1.3 t/s.
These newer SoC like Exynos 2200 have three types of cores, so I'm not sure which ones should be used for best performance. Won't know until someone tests it to find out.
For example on my Pinebook Pro (RK3399) today I tested 3B and it gets almost the same speed if I use 4 A-53 cores or 2 A-72 cores, but if I try to use all of them it is much slower. So just by only using the performance cores, most of the CPU cores are not even used.
It appears llama.cpp has no limit, and makes no estimate on the hardware for the system it's installed. I'm not complaining, it is what it is. In this way, it's powerful so long as one narrows the parameters for the specific device/system.
I have an SBC with the new RK3588S as well, and this one can generate 7B on its four A-76 cores at around 3.3 t/s. Using the four Cortex A-55 cores it is 0.8 t/s, using all cores 1.3 t/s.
--threads 8 is essentially full device lock for me. Termux/llama.cpp fights the Operating system for resources. It's cool that it can do that, but it's ineffecient.
--threads 5 keeps my CPU around 80-90%, but lower performance vs. the sweet-spot for my device: --threads 4 on OpenBlas.
If CLBlast built, then --threads 3 is better with the -ngl parameter. It's interesting to watch the resource monitor during inference: CPU throttles around 50-70%, and GPU starts less than 1%, spikes around 80-100% for a few seconds, then hovers around 20-30% while writing the response.
Hi, I'm trying to compile llama.cpp using my opencl drivers. My device is a Samsung s10+ with termux.
On downloading and attempting make with LAMA_CLBLAST=1, I receive an error:
I edited the ggml-open.cl.cpp file TRYING to point it to my opencl libraries by replacing with ocl_icd.h. (as my library path is /data/data/com.termux/files/usr/include)
Then with make LLAMA_CLBLAST=1 I received this:
Current Behavior
It appears my libraries for opencl are not included and I don't know how to make llama.cpp recognize them during compilation.
clinfo:
lscpu:
clpeak:
Thanks for any direction on this matter.