ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.91k stars 9.74k forks source link

Output halts for 10 seconds every 10-20 lines in long output using Apple Silicon #1809

Closed shouyiwang closed 1 year ago

shouyiwang commented 1 year ago

I understand that a bug report similar to this one was submitted recently. However, that issue was resolved by removing the "-n" parameter, while the issue I'm encountering is distinct.

For context, my system configuration is as follows: MacBook Air M1 with 16GB of RAM, running macOS 13.3. I am using Python 3.10.9, GNU Make 3.81, and g++ with Apple Clang version 14.0.3.

While executing the recent versions of the code, including the code in the K-quant branch, I have observed that the output halts for approximately 10 seconds after every 10-20 lines, after 256 tokens have been processed.

I conducted tests with various models, including the 13B Q4_0 and 7B Q3_K_M. Given that my system has 16GB of RAM, I believe it should be sufficient to accommodate any of these models, especially the 7B variants.

Steps to Reproduce: Execute the following command: ./main -m ggml_xxx.bin -ngl 1 -p "write a story about a cruise travel in 1000 words"

Please be patient, if the outputed story is short, you will not observe the problem. It will only appear when the output is long.

During my tests with the 13B Q4_0 model, I observed the following in the Activity Monitor app:

When the program is actively outputting data: CPU utilization is low GPU utilization is close to 100% Memory usage is around 14.8 GB Cache files occupy approximately 1 GB

During the brief intervals where the output is suspended (lasting around 10 seconds): CPU utilization spikes to nearly 100% GPU utilization drops to 0% Memory usage reduces to 8 GB Cache size increases to 8 GB

After 10 seconds, the resource usage reverts back to what is observed during active output.

I would greatly appreciate it if you could take a moment to investigate this issue. Thank you.

africalimedrop commented 1 year ago

out of interest, and while you're waiting for people with actual know-how to show up, try adding --ctx-size 2048

my thinking here is that the default context is getting filled up dynamically during completion and is being rotated (possibly inefficiently if it's even actually happening) causing the hitch. increasing the size may make the hitch even longer, but you should also not encounter it until many more tokens have been generated, with the goal being that you won't run into it under 'single prompt and exit' conditions. long instruct or chat will still encounter it, of course. anyway, i pulled all that out of my butt and it may not even be how things work. that's just my intuition as a layman.
shouyiwang commented 1 year ago

@africalimedrop IT WORKS! Thank you so much!

I am wondering why the default context size is not changed to a larger value. The current default appears to be lower than 512, which in my opinion is very small.

SlyEcho commented 1 year ago

The default is 512, but when it fills up, the first half is deleted to create space at the end, that's why it seems to happen every 256 tokens for you.

There is not much to do about this, 2048 is bigger but it has the same issue, ultimately.

The only way is to speed up the evaluation with hardware acceleration like Metal on MacOS.

shouyiwang commented 1 year ago

@SlyEcho I noticed x86 CPUs running on Linux perform evaluations over 10 times faster than M1, regardless of whether it's using CPU or Metal. Could this speed difference be due to the fact that x86 CPUs have AVX while M1 don't?

SlyEcho commented 1 year ago

M1 has other instructions like Neon that fill the same role and llama.cpp uses them.

I think your problem on the Mac is related to memory, or rather the lack of it. Did you try the --mlock or maybe even --no-mmap flags?

shouyiwang commented 1 year ago

@SlyEcho I gave it another shot using the 7b q4_0 model with --mlock enabled on both a Linux system with a Ryzen 5600 CPU and an M1 Mac.

Ryzen 5600 cpu only:

llama_print_timings: prompt eval time =   641.59 ms /   111 tokens (    5.78 ms per token,   173.01 tokens per second)
llama_print_timings:        eval time = 19288.90 ms /   127 runs   (  151.88 ms per token,     6.58 tokens per second)

M1 (CPU):

llama_print_timings: prompt eval time =  4091.32 ms /   111 tokens (   36.86 ms per token,    27.13 tokens per second)
llama_print_timings:        eval time =  9503.74 ms /   127 runs   (   74.83 ms per token,    13.36 tokens per second)

My Mac has 16GB of RAM, which is more than enough for a 7b model. I haven't noticed any disk read/write activity after loading it into RAM. However, Ryzen 5600 performs 6.4x better than M1 in terms of prompt processing speed. But in many other benchmarks, 5600 and M1 perform at the same level.

SlyEcho commented 1 year ago

5.78 ms per token

Are you sure the Ryzen system is CPU-only? Because these speeds are only achievable with CUDA or something.

shouyiwang commented 1 year ago

@SlyEcho I am sure. The previous command was executed without -ngl, my only GPU was idle during the inference.

And this is my cuda (4090) figure with -ngl 128:

llama_print_timings: prompt eval time =   127.26 ms /   111 tokens (    1.15 ms per token,   872.22 tokens per second)
llama_print_timings:        eval time =  1075.21 ms /   127 runs   (    8.47 ms per token,   118.12 tokens per second)

Just be sure that the first line is the time for evaluating the prompt, and the second line is the time for generating tokens.

SlyEcho commented 1 year ago

Even without -ngl it uses the GPU for the prompt evaluation.

shouyiwang commented 1 year ago

@SlyEcho I didn't know that. Why is it much faster with -ngl?

SlyEcho commented 1 year ago

With -ngl it doesn't have to copy the data to the GPU all the time.

shouyiwang commented 1 year ago

@SlyEcho I just recompiled llama.cpp on Linux without CUDA support, so now it indeed runs on the CPU. The output is:

llama_print_timings: prompt eval time =  5234.21 ms /   111 tokens (   47.16 ms per token,    21.21 tokens per second)
llama_print_timings:        eval time = 18948.20 ms /   127 runs   (  149.20 ms per token,     6.70 tokens per second)

Yeah, it's slower than M1.

Thank you for your help!