Running on Non GPU laptops

twelsh37 commented 1 month ago

Hey.

I am writing an article comparing and contrasting my desktop PG to my laptop. It runs fine on the desktop and gets decent throughput. Desktop

{
    "hide it"
}

My laptop isn't as beefy, Laptop

hide it

Running 'llm_benchmark run' in a Python virtual environment on my laptop is taking a very long time just to execute the first prompt against the mistral:7b model. It has been running for well over two hours.

The program did pull the 7 LLMs it required.

Looking at performance on my laptop I see the following from Task Manager

Windows Task Manager

CPU Utilisation : 80%
CPU Speed 4.64 Ghz
Memory in Use : 10.4 GB
Memory Available : 5.2 GB

Disk Space

Total Disk Space : 474 GB
Disk Space Available : 236 GB

Any pointers to make this run. I am convinced it cant be the ollama install as I can run "Write a step-by-step guide on how to bake a chocolate cake from scratch" against ollama running llama3:8b and it completes in a little under 3 minutes. (that's a rough guestimate from scrolling back in the logs)

Running the same prompt from 'ollama run mistral:7b' cli it completes even faster.

Why does it not complete from the llm_benchmark?

I have attached the server and app logs from my laptop to the issue app.log server.log

chuangtc commented 1 month ago

From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s

[GIN] 2024/05/23 - 11:45:01 | 200 |      1h40m51s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2024/05/23 - 11:45:31 | 200 |     48.6085ms |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/05/23 - 11:46:26 | 200 |   54.0535545s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:30 | 200 |    4.4838203s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:34 | 200 |    3.9806745s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:38 | 200 |    4.1286402s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:41 | 200 |    2.5784539s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:45 | 200 |    4.1422973s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |    6.5098253s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/05/23 - 11:46:51 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/05/23 - 11:46:51 | 200 |       2.607ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/05/23 - 13:41:53 | 200 |       1h55m1s |       127.0.0.1 | POST     "/api/generate"

On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate.

As far as I know, the functionality of /api/generate does the following.

Receive Input Data: The endpoint receives a request containing input data. This could be in the form of a text prompt, parameters specifying the type of generation required, and possibly additional settings like temperature, maximum token count, etc.
Process Input: The server processes the input data, which might include pre-processing steps such as tokenization, input validation, and ensuring the input meets the required format.
Generate Output Using Model: The server uses a pre-trained model to generate the desired output based on the input. This involves passing the processed input data to the model, which then produces an output. The model could be a language model, image generation model, or any other type of generative model.
Post-Process Output: The generated output is post-processed to ensure it is in a suitable format for the user. This could include converting tokens back to text, formatting the output, and applying any necessary filters.
Send Response: The server sends back the generated content to the client as a response. This response typically includes the generated text or data, and possibly metadata about the generation process (such as time taken, tokens used, etc.).

The way for my llm_benchmark, it calls result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')

Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate

twelsh37 commented 1 month ago

Thanks for the reply. I missed that when I looked at the logs. Yes 1hr 55mins is a bit too long.

If I run the query on the cli it goes through fine.

I'll go have a look and see what I can accertain

On Thu, 23 May 2024, 16:26 Jason TC Chuang, @.***> wrote:

From your server.log, I noticed your POST "/api/generate" is taking too long, 1h40m51s

[GIN] 2024/05/23 - 11:45:01 | 200 | 1h40m51s | 127.0.0.1 | POST "/api/generate" [GIN] 2024/05/23 - 11:45:31 | 200 | 48.6085ms | 127.0.0.1 | GET "/api/version" [GIN] 2024/05/23 - 11:46:26 | 200 | 54.0535545s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:30 | 200 | 4.4838203s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:34 | 200 | 3.9806745s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:38 | 200 | 4.1286402s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:41 | 200 | 2.5784539s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:45 | 200 | 4.1422973s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:51 | 200 | 6.5098253s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/05/23 - 11:46:51 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/23 - 11:46:51 | 200 | 2.607ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/23 - 13:41:53 | 200 | 1h55m1s | 127.0.0.1 | POST "/api/generate"

On my Windows machine, I used Windows Powershell to invoke python, it takes 3 to 4 minutes for /api/generate.

As far as I know, the functionality of /api/generate does the following.

Receive Input Data: The endpoint receives a request containing input data. This could be in the form of a text prompt, parameters specifying the type of generation required, and possibly additional settings like temperature, maximum token count, etc.

Process Input: The server processes the input data, which might include pre-processing steps such as tokenization, input validation, and ensuring the input meets the required format.

Generate Output Using Model: The server uses a pre-trained model to generate the desired output based on the input. This involves passing the processed input data to the model, which then produces an output. The model could be a language model, image generation model, or any other type of generative model.

Post-Process Output: The generated output is post-processed to ensure it is in a suitable format for the user. This could include converting tokens back to text, formatting the output, and applying any necessary filters.

Send Response: The server sends back the generated content to the client as a response. This response typically includes the generated text or data, and possibly metadata about the generation process (such as time taken, tokens used, etc.).

The way for my llm_benchmark, it calls result = subprocess.run([ollamabin, 'run', model_name, one_prompt['prompt'],'--verbose'], capture_output=True, text=True, check=True, encoding='utf-8')

Maybe you can consult ollama author to help investigate the issues of taking too long for /api/generate

— Reply to this email directly, view it on GitHub https://github.com/aidatatools/ollama-benchmark/issues/8#issuecomment-2127421584, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2EYQMH237HKQBOV53RETZDYDDDAVCNFSM6AAAAABIFUHNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGQZDCNJYGQ . You are receiving this because you authored the thread.Message ID: @.***>

chuangtc commented 1 month ago

Hide system info. to protect user's privacy. Remove sensitive hardware spec in the question details.

twelsh37 commented 1 month ago

Hey. I got to the bottom of this. I hacked around in the code and made my own script to carry out the tests.

The server was timing out. I had to set a 300-second timeout on the tests, or they would fail. I have since bumped that up to 600 seconds, as I just want the test to pass.

Below is the output from the first two tests As you can see the Total Duration time is rediculous. 169 seconds and 200 seconds :(

Model: mistral:7b
Prompt: Write a step-by-step guide on how to bake a chocolate cake from scratch.
Total Duration Time (ms): 169432.6
Load Duration Time (ms): 6.76
Prompt Eval Time (ms): 1757.56, Eval Count: 21
Performance (tokens/s): 4.29

Model: mistral:7b
Prompt: Develop a python function that solves the following problem - sudoku game.
Total Duration Time (ms): 200958.45
Load Duration Time (ms): 5.4
Prompt Eval Time (ms): 1451.98, Eval Count: 17
Performance (tokens/s): 4.26

chuangtc commented 1 month ago

If you think your hacking can bring benefits to the whole ollama community, please fork my code, and then create a pull request. Let me check your hacking. We both want the community get benefits from the tool. Please see my post on linkedin. https://www.linkedin.com/pulse/ollama-benchmark-helps-buyers-decide-which-hardware-spec-chuang-ob7dc/

aidatatools / ollama-benchmark

Running on Non GPU laptops #8