aidatatools / ollama-benchmark

LLM Benchmark for Throughput via Ollama (Local LLMs)
https://llm.aidatatools.com/
MIT License
64 stars 13 forks source link

Support for small systems / SBCs #9

Closed dan-and closed 4 weeks ago

dan-and commented 4 weeks ago

Thanks for this straightforward benchmarking tool for ollama.

As small systems with low amount of memory, but added NPUs are on the horizon, slightly adjusted the scripts to allow either systems with 4gb or lower, as well as a super low-mem variant of just 2gb ram.

This also helps to measure the smallest GPUs performance measurement without getting penalties due to cpu/gpu workload splits.

I will add a few comments with results of various machine combinations, include Raspberry PIs.

I never uploaded them, as additional model variants were added and I don't know how your backend will react.

dan-and commented 4 weeks ago

Benchmark with a Nvidia CMP 30HX (6GB Ram, 1660 Super equivalent)

`$ llm_benchmark run --no-sendinfo -------Linux---------- {'id': '0', 'name': 'NVIDIA CMP 30HX', 'driver': '555.42.02', 'gpu_memory_total': '6144.0 MB', 'gpu_memory_free': '5753.0 MB', 'gpu_memory_used': '391.0 MB', 'gpu_load': '0.0%', 'gpu_temperature': '33.0°C'} Only one GPU card Total memory size : 31.23 GB cpu_info: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz gpu_info: NVIDIA CMP 30HX os_version: Ubuntu 24.04 LTS ollama_version: 0.1.41

LLM models file path:/home/daniel/source/ollama-benchmark/llm_benchmark/data/benchmark_models_16gb_ram.yml Checking and pulling the following LLM models qwen:1.8b phi:2.7b gemma:2b gemma:7b mistral:7b llama3:8b phi3:3.8b llava:7b llava:13b

model_name = mistral:7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 19.56 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 17.72 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 22.64 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 19.34 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 21.36 tokens/s

Average of eval rate: 20.124 tokens/s

model_name = llama3:8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 18.09 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 18.04 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 18.47 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 18.98 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 24.49 tokens/s

Average of eval rate: 19.614 tokens/s

model_name = phi3:3.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 22.15 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 22.43 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 24.46 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 20.65 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 28.70 tokens/s

Average of eval rate: 23.678 tokens/s

model_name = qwen:1.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 54.99 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 56.00 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 54.11 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 60.31 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 55.01 tokens/s

Average of eval rate: 56.084 tokens/s

model_name = phi:2.7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 49.26 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 33.70 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 49.46 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 46.23 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 65.02 tokens/s

Average of eval rate: 48.734 tokens/s

model_name = gemma:2b prompt = Explain Artificial Intelligence and give its applications. eval rate: 46.82 tokens/s prompt = How are machine learning and AI related? eval rate: 47.80 tokens/s prompt = What is Deep Learning based on? eval rate: 46.78 tokens/s prompt = What is the full form of LSTM? eval rate: 57.02 tokens/s prompt = What are different components of GAN? eval rate: 46.21 tokens/s

Average of eval rate: 48.926 tokens/s

model_name = gemma:7b prompt = Explain Artificial Intelligence and give its applications. eval rate: 8.01 tokens/s prompt = How are machine learning and AI related? eval rate: 8.16 tokens/s prompt = What is Deep Learning based on? eval rate: 8.25 tokens/s prompt = What is the full form of LSTM? eval rate: 8.96 tokens/s prompt = What are different components of GAN? eval rate: 8.09 tokens/s

Average of eval rate: 8.294 tokens/s

model_name = llava:7b prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample1.jpg eval rate: 30.44 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample2.jpg eval rate: 29.72 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample3.jpg eval rate: 28.62 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample4.jpg eval rate: 29.83 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample5.jpg eval rate: 30.33 tokens/s

Average of eval rate: 29.788 tokens/s

model_name = llava:13b prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample1.jpg eval rate: 4.26 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample2.jpg eval rate: 4.17 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample3.jpg eval rate: 4.07 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample4.jpg eval rate: 4.20 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample5.jpg eval rate: 4.20 tokens/s

Average of eval rate: 4.18 tokens/s

`

dan-and commented 4 weeks ago

Benchmark with a Nvidia 1060 (3GB Ram, Low cost version of the GTX 1060)

`$ llm_benchmark run --no-sendinfo -------Linux---------- {'id': '0', 'name': 'NVIDIA GeForce GTX 1060 3GB', 'driver': '555.42.02', 'gpu_memory_total': '3072.0 MB', 'gpu_memory_free': '3005.0 MB', 'gpu_memory_used': '67.0 MB', 'gpu_load': '0.0%', 'gpu_temperature': '28.0°C'} Only one GPU card Total memory size : 31.23 GB cpu_info: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz gpu_info: NVIDIA GeForce GTX 1060 3GB os_version: Ubuntu 24.04 LTS ollama_version: 0.1.41

LLM models file path:/home/daniel/source/ollama-benchmark/llm_benchmark/data/benchmark_models_16gb_ram.yml Checking and pulling the following LLM models qwen:1.8b phi:2.7b gemma:2b gemma:7b mistral:7b llama3:8b phi3:3.8b llava:7b llava:13b

model_name = mistral:7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 6.06 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 6.94 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 7.12 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 7.18 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 7.65 tokens/s

Average of eval rate: 6.99 tokens/s

model_name = llama3:8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 5.40 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 5.58 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 5.58 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 5.57 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 5.86 tokens/s

Average of eval rate: 5.598 tokens/s

model_name = phi3:3.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 16.60 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 17.07 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 17.74 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 17.36 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 18.40 tokens/s

Average of eval rate: 17.434 tokens/s

model_name = qwen:1.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 66.52 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 49.25 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 68.50 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 69.48 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 68.59 tokens/s

Average of eval rate: 64.468 tokens/s

model_name = phi:2.7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 47.44 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 47.06 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 115.01 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 115.13 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 56.63 tokens/s

Average of eval rate: 76.254 tokens/s

model_name = gemma:2b prompt = Explain Artificial Intelligence and give its applications. eval rate: 58.20 tokens/s prompt = How are machine learning and AI related? eval rate: 58.62 tokens/s prompt = What is Deep Learning based on? eval rate: 58.33 tokens/s prompt = What is the full form of LSTM? eval rate: 62.70 tokens/s prompt = What are different components of GAN? eval rate: 58.26 tokens/s

Average of eval rate: 59.222 tokens/s

model_name = gemma:7b prompt = Explain Artificial Intelligence and give its applications. eval rate: 4.18 tokens/s prompt = How are machine learning and AI related? eval rate: 4.24 tokens/s prompt = What is Deep Learning based on? eval rate: 4.24 tokens/s prompt = What is the full form of LSTM? eval rate: 4.45 tokens/s prompt = What are different components of GAN? eval rate: 4.21 tokens/s

Average of eval rate: 4.264 tokens/s

model_name = llava:7b prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample1.jpg eval rate: 6.67 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample2.jpg eval rate: 6.61 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample3.jpg eval rate: 6.60 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample4.jpg eval rate: 6.48 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample5.jpg eval rate: 6.62 tokens/s

Average of eval rate: 6.596 tokens/s

model_name = llava:13b prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample1.jpg eval rate: 2.86 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample2.jpg eval rate: 2.87 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample3.jpg eval rate: 2.89 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample4.jpg eval rate: 2.84 tokens/s prompt = Describe the image, /home/daniel/source/ollama-benchmark/llm_benchmark/data/img/sample5.jpg eval rate: 2.88 tokens/s

Average of eval rate: 2.868 tokens/s ----------------------------------------`

dan-and commented 4 weeks ago

Raspberry Pi 4 - 2GB SBC with Debian

` llm_benchmark run --no-sendinfo -------Linux----------

No GPU detected. Total memory size : 1.80 GB cpu_info: l2-cache0 gpu_info: unknown os_version: Debian GNU/Linux 12 (bookworm) ollama_version: 0.1.42

LLM models file path:/home/daniel/source/ollama-benchmark/llm_benchmark/data/benchmark_models_2gb_ram.yml Checking and pulling the following LLM models qwen:1.8b gemma:2b

model_name = qwen:1.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 2.94 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 3.00 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 3.05 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 3.17 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 3.11 tokens/s

Average of eval rate: 3.054 tokens/s

model_name = gemma:2b prompt = Explain Artificial Intelligence and give its applications. eval rate: 1.84 tokens/s prompt = How are machine learning and AI related? eval rate: 1.82 tokens/s prompt = What is Deep Learning based on? eval rate: 1.81 tokens/s prompt = What is the full form of LSTM? eval rate: 1.84 tokens/s prompt = What are different components of GAN? eval rate: 1.81 tokens/s

Average of eval rate: 1.824 tokens/s

`

dan-and commented 4 weeks ago

Raspberry Pi 4 - 4GB SBC with Debian

`# llm_benchmark run --no-sendinfo -------Linux----------

No GPU detected. Total memory size : 3.70 GB cpu_info: Cortex-A72 gpu_info: no_gpu os_version: Debian GNU/Linux 12 (bookworm) ollama_version: 0.1.41

LLM models file path:/home/daniel/source/ollama-benchmark/llm_benchmark/data/benchmark_models_3gb_ram.yml Checking and pulling the following LLM models qwen:1.8b phi:2.7b phi3:3.8b gemma:2b

model_name = phi3:3.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 1.27 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 1.21 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 1.25 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 1.23 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 1.29 tokens/s

Average of eval rate: 1.25 tokens/s

model_name = qwen:1.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 3.13 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 2.78 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 2.60 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 3.13 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 3.10 tokens/s

Average of eval rate: 2.948 tokens/s

model_name = phi:2.7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 1.82 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 2.10 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 3.99 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 4.00 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 2.07 tokens/s

Average of eval rate: 2.796 tokens/s

model_name = gemma:2b prompt = Explain Artificial Intelligence and give its applications. eval rate: 1.93 tokens/s prompt = How are machine learning and AI related? eval rate: 1.89 tokens/s prompt = What is Deep Learning based on? eval rate: 1.99 tokens/s prompt = What is the full form of LSTM? eval rate: 1.92 tokens/s prompt = What are different components of GAN? eval rate: 1.88 tokens/s

Average of eval rate: 1.922 tokens/s `

dan-and commented 4 weeks ago

Raspberry Pi 5 - 4 GB

`$ llm_benchmark run --no-sendinfo -------Linux----------

No GPU detected. Total memory size : 3.95 GB cpu_info: Cortex-A76 gpu_info: no_gpu os_version: Debian GNU/Linux 12 (bookworm) ollama_version: 0.1.41

LLM models file path:/home/daniel/source/ollama-benchmark/llm_benchmark/data/benchmark_models_3gb_ram.yml Checking and pulling the following LLM models qwen:1.8b phi:2.7b phi3:3.8b gemma:2b

model_name = phi3:3.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 3.54 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 3.47 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 3.62 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 3.52 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 3.76 tokens/s

Average of eval rate: 3.582 tokens/s

model_name = qwen:1.8b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 6.62 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 7.46 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 7.15 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 8.57 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 8.34 tokens/s

Average of eval rate: 7.628 tokens/s

model_name = phi:2.7b prompt = Write a step-by-step guide on how to bake a chocolate cake from scratch. eval rate: 4.71 tokens/s prompt = Develop a python function that solves the following problem, sudoku game eval rate: 5.09 tokens/s prompt = Create a dialogue between two characters that discusses economic crisis eval rate: 10.77 tokens/s prompt = In a forest, there are brave lions living there. Please continue the story. eval rate: 10.75 tokens/s prompt = I'd like to book a flight for 4 to Seattle in U.S. eval rate: 5.41 tokens/s

Average of eval rate: 7.346 tokens/s

model_name = gemma:2b prompt = Explain Artificial Intelligence and give its applications. eval rate: 5.44 tokens/s prompt = How are machine learning and AI related? eval rate: 5.47 tokens/s prompt = What is Deep Learning based on? eval rate: 5.40 tokens/s prompt = What is the full form of LSTM? eval rate: 5.80 tokens/s prompt = What are different components of GAN? eval rate: 5.45 tokens/s

Average of eval rate: 5.512 tokens/s ----------------------------------------`

chuangtc commented 4 weeks ago

Integration is done. v0.3.20 on pypi https://pypi.org/project/llm-benchmark/

dan-and commented 4 weeks ago

Thank you very much:-)

chuangtc commented 4 weeks ago

@dan-and qwen2 is the latest one. In your testing environment, which one do you recommend? I am going to retire qwen, and put qwen2 For phi from Microsoft, I think phi3:3.8b is good enough with the latest version. I am thinking to retire phi:2.7b. Any opinion is appreciated. Thank you.

dan-and commented 4 weeks ago

Hey Jason,

I'm totally fine with retiring old models. However, please keep in mind, that memory restricted systems should be able to continue to run.

benchmark_models_2gb_ram.yml: Substitude for qwen:1.8b would be qwen2:1.5b, which utilizes just 1.3gb of RAM

benchmark_models_3gb_ram.yml: phi:2.7b is a sweet spot memory wise at 2.4gb, so it fits nicely in old GPU cards like the 1060 3gb. Also, a great option for 4gb SBCs, which gives enough headroom to prevent swapping ram.

phi3:3.8b on the other hand utilizes 3.4gb, which doesn't fit anymore.

I totally understand that you try to retire old models, but I have no good substitute for a sub 3gb model based on phi3.

Just my 2 cents

chuangtc commented 3 weeks ago

OK...Follow your suggestion. Here is the integration result on pypi. 0.3.21 https://pypi.org/project/llm-benchmark/#history