ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.91k stars 9.46k forks source link

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed MillionthOdin16 closed 1 year ago

MillionthOdin16 commented 1 year ago

Expected Behavior

I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.

Current Behavior

The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.

Environment and Context

I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).

Steps to Reproduce

  1. Build the current version of llama.cpp with hardware-specific compiler flags.
  2. Execute the llama.cpp executable using the gpt4all language model and record the performance metrics.
  3. Execute the default gpt4all executable (previous version of llama.cpp) using the same language model and record the performance metrics.
  4. You'll see that the gpt4all executable generates output significantly faster for any number of threads or config.

Here's some context/config when I'm doing the runs:

image (left panel is latest llama.cpp, right panel is gpt4all build)

This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp

*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat

sw commented 1 year ago

@cyyynthia : it's great that you're putting in all this work

Had a go at reverting #439 and #500, don't know if that would help you: https://github.com/sw/llama.cpp/commit/f747a43 (edit: some mistakes hopefully fixed: https://github.com/sw/llama.cpp/commit/f6d83dc)

It's faster than latest master at first glance, but I really haven't tested it thoroughly.

cyyynthia commented 1 year ago

It's faster than latest master at first glance, but I really haven't tested it thoroughly.

@sw Oh yes it is! Only 333s runtime! 107ms per token on average, this is by far the fastest run of them all!! Here is the graph (with only interesting commits): image

And the raw CSV data: token_times.csv

MillionthOdin16 commented 1 year ago

@cyyynthia Okay, wow. Can you summarize that graph so I can make sure I'm interpreting it correctly? The vertical axis is time in seconds and x is token count in the context? Is this a fixed test that measures ms per token, and the difference is that great?

relevant commits (for context tooltips) 437e778, 4870e45, 404e1da, sw/llama.cpp@f747a43

Also thanks to the awesome people who are divining into performance analysis! Really appreciate the effort by all of you <3

cyyynthia commented 1 year ago

The vertical axis is in microseconds, and the horizontal axis is the number of tokens. I ran main with set parameters and let it generate 2000 tokens. For each token, I logged the time it took to generate it and plotted it on the graph.

Each line is a single test on a specific commit (single run). The amount of times it takes to generate a single token grows over time as there are more and more things in the context store to care about. The yellow line represents when #439 was introduced, where we can see the time it takes to generate a single token grows much faster than before (the green line). The blue line is current master branch, and we can see the regression is still very pronounced.

The red line is sw' patch which is master with #439 (and #500) reverted. We can see the time it takes to generate a single token is much lower, and grows much slower (exactly like it used to do). We can also see there is an overall performance increase, from all the SIMD optimizations that have been done in the past days.

I hope this explains it well enough!

edit: for the reference, here are the total run times for all runs (from newer to oldest): sw/llama.cpp@f747a43: 333s 437e778: 1448s 4870e45: 1673s 404e1da: 415s d5850c5: 406s da0e9fe: 411s 9b4a15b: 411s

MillionthOdin16 commented 1 year ago

That's awesome! I always wonder what the actual performance increase is when I see commits for with additional optimizations.

This plot is super useful for evaluating performance. I wonder if we can turn it into a script that runs for each new commit / publish as a workflow both as a performance sanity check and just because it's cool to see the progress :)

Just an idea for the future. But awesome job!!!

ivanstepanovftw commented 1 year ago

I did not notice any performance improvement for https://github.com/sw/llama.cpp/commit/f747a43 =.= Slightly worse perplexity and OpenBLAS broken.

https://github.com/sw/llama.cpp/commit/f747a43:

main: seed = 1680654638
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
33.30 seconds per pass - ETA 6.06 hours
[1]4.5738,[2]5.1249,[3]5.9216,[4]6.6188,[5]6.7176,[6]6.7239,[7]6.9237,[8]7.0217,^C

Main:


28.09 seconds per pass - ETA 5.11 hours
[1]4.5619,[2]5.1787,[3]6.0491,[4]6.7494,[5]6.7852,[6]6.7668,[7]6.9845,
cyyynthia commented 1 year ago

I did not notice any performance improvement for sw@f747a43 =.=

I'm highly unsure if perplexity is a meaningful test for measuring performance, at least here. As I've illustrated in my graphs, at the beginning the token time is fine so long only very few tokens are in the context, and only as the context fills up the degradation shows up and we get further and further away from the baseline. At 512 tokens, my data shows 2x slowdown in token time and at 2000 tokens a 7.5x slowdown.

I think the fact OpenBLAS being broken is sort of expected, as the point of the branch was to see if the same degradation of token time was occurring if we reverted the offending commit but kept everything else - and according to my data the degradation completely vanishes - and not to find a proper fix for the degradation yet.

edit: I'm even more convinced that perplexity and the shown ETA is not a good test at all, as it seems the measurement is done on an empty context (see here), meaning it is totally blind to the core of this issue and what my analysis points out.

MillionthOdin16 commented 1 year ago

I see a significant improvement for https://github.com/sw/llama.cpp/commit/f747a43 . My experience on windows matches cyyynthia's data. I didn't realize how bad it had gotten until I ran the reverted version and generation was so much faster. Thanks :)

sw commented 1 year ago

Glad to see that this shows an improvement for some of you. I must admit I just quickly threw this together. I haven't looked at OpenBLAS at all so I apologize for that being broken. (edit: can't get OpenBLAS to link right away, but some mistakes are hopefully fixed here: https://github.com/sw/llama.cpp/commit/f6d83dc)

Reverting #439 is straightforward, but my second commit which purports to revert #500 is a bit disingenious what the commit description is concerned. It certainly deserves further scrutiny, for example the int->int64_t changes from #626 are likely incomplete, which may break large tensors (as seen in #599).

Also, my intention is certainly not to have this slapped back onto master without further investigation. The intentions behind #439 and #500, namely reducing the complexity of the code base, were good after all.

KASR commented 1 year ago

@ivanstepanovftw I agree with @cyyynthia we first need to confirm the root cause of the performance degradation, right now it does appear to be #439. On a side note, it has been discussed that perplexity is improved when using blas (see #406 for more info) so it's difficult to compare the results with/without blas.

I also agree with @sw --> ggml_cpy was introduced for a reason. To quote ggerganov : "Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml." and indeed before 439 the inference results differ when changing e.g. the number of threads/batch size etc.

@sw depending on the number of tokens, especially at n=2048 the improvements are very big (at least on windows when using cmake and vs to build, can't say anything for other configurations) i.e. this has a significant impact during interactive mode

cyyynthia commented 1 year ago

this has a significant impact during interactive mode

I believe this is incorrect. It's not the -n that matters, it's how many things are in the context memory (i.e. -n_ctx and how far we are in the generation/interaction). On the revert branch, I've had significantly faster responses in interactive mode on the 13B model. I don't have data to back all this up, but I'm pretty sure the impact is the same in interactive mode.

ggerganov commented 1 year ago

Thank you for this hard work - I missed this regression because I rarely run generations with more than a few tens of tokens. The problem is that the transpose operation for the V matrix is very slow and becomes slower and slower with every new token added.

I think I have provided a fix here: https://github.com/ggerganov/llama.cpp/pull/775

Tested only on M1 so far

sw commented 1 year ago

775 was merged pretty quickly, @cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

MillionthOdin16 commented 1 year ago

Also @cyyynthia is there a way that we can adapt your process as a way to analyze future builds?

Do you have a script or something that you were using that could help us create something like that?

People have done some pretty cool stuff and I want to make sure it isn't lost once the issue is resolved.

cyyynthia commented 1 year ago

775 was merged pretty quickly, cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

@sw The graphs aren't super pretty this time because I didn't take the time to properly close everything and had a bunch of things open in background while the test was running 😅

That being said, the regression appears to be gone. 🎉 Here's the graphs and the raw CSV: image token_times.csv

Do you have a script or something that you were using that could help us create something like that?

@MillionthOdin16 As described earlier, my process was simply to shoving a fprintf(stderr, "debug_token_time: %ld\n", ggml_time_us() - t_start_us); in the llama_eval_internal function, and dumping the numbers it spat in a CSV. Nothing too complex 😅

It could be added to llama.cpp itself behind a compile flag for enabling it more easily. But it's not rocket science, and I didn't use any special script for it.

MillionthOdin16 commented 1 year ago

Okay, thanks for summarizing the process.

That graph makes me so happy ❤️

sw commented 1 year ago

Well I guess that settles it. The Great Wizard Georgi has saved the day! Thanks to @cyyynthia and @KASR for putting in the hard work of tracking this down.

I have opened #790 to track the discrepancy in the different partial times vs total time. I think this issue could be closed. Thanks everyone.