ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.9k stars 9.46k forks source link

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed MillionthOdin16 closed 1 year ago

MillionthOdin16 commented 1 year ago

Expected Behavior

I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.

Current Behavior

The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.

Environment and Context

I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).

Steps to Reproduce

  1. Build the current version of llama.cpp with hardware-specific compiler flags.
  2. Execute the llama.cpp executable using the gpt4all language model and record the performance metrics.
  3. Execute the default gpt4all executable (previous version of llama.cpp) using the same language model and record the performance metrics.
  4. You'll see that the gpt4all executable generates output significantly faster for any number of threads or config.

Here's some context/config when I'm doing the runs:

image (left panel is latest llama.cpp, right panel is gpt4all build)

This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp

*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat

sw commented 1 year ago

Screenshot on the left has FMA and SSE3 enabled, is that the one that is faster? Try building both with the same flags.

MillionthOdin16 commented 1 year ago

No, that's what I'm saying. The left is optimized llama.cpp, the right is unoptimized gpt4all. The unoptimized gpt4all is significantly faster than the optimized llama. So there's something wrong.

slaren commented 1 year ago

Without bisecting the exact commit that introduced a performance regression it is hard to do much about it. I suspect that it happened when the code for transposed matrix multiplication was replaced with a copy.

MillionthOdin16 commented 1 year ago

First I'm just trying to check if someone else can reproduce the behavior. I def wanna use llama version because it has more features haha

ggerganov commented 1 year ago

Double check you have built llama.cpp in Release

x02Sylvie commented 1 year ago

Personally had small performance regression (13b model) over last 14 days, from 207 ms to 238 ms, but its no biggie

xor2003 commented 1 year ago

I see significant slowness when comparing on windows latest llama.cpp 30B with gpt4all lora

dakennedyd commented 1 year ago

I also notice performance drops on x86-64 Linux, it also uses a lot more memory than before. I compiled the project following the instructions on the Readme.md

jessejohnson commented 1 year ago

I am using the same language model for both executables

@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue

MillionthOdin16 commented 1 year ago

I am using the same language model for both executables

@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue

Did you try it in the last few hours? There were some commits couple hours ago that made it easy. Look at the readme section that was added for GPT4all

It kinda drives me crazy that the one dude forked llama.cpp then stopped maintaining it because other repos are forking his repo which is outdated โ˜น๏ธ

alph4b3th commented 1 year ago

what is gpt4all and what changes to llama.cpp and alpaca.cpp? I arrived by parachute here in the conversation, I don't really know the context, I apologize

anzz1 commented 1 year ago

We should definitely look into this as this definitely shouldn't be the case. I'm pretty confident though that enabling the optimizations didn't do that since when we did that #375 the perf was pretty well researched. If performance got lost and memory usage went up somewhere along the way, we'll need to look at where this happened. If it doesn't run well, everyone just going to fork from a older point in time instead.

@BrunoIsaac27

alpaca.cpp is pretty much a fork from a older llama.cpp version (which is apparently faster), but nothing much is really changed except changing a few default variables.

gpt4all is a fork from the alpaca.cpp fork with modifications tailored specifically to the gpt4all model.

ggerganov commented 1 year ago

Someone on Windows should try and bisect to see where this alleged degradation happens - I don't have a Windows machine. I'm pretty sure there is no performance degradation no Mac OS and Linux. The only slowness introduced, as @slaren mentioned, was the removal of the transposed ggml_mul_mat path which led to about %10 performance loss during single-token inference (i.e. after prompt ingestion). But this should have been compensated by the various updates in the SIMD code.

I guess the ggml_mul_mat change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.

But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.

anzz1 commented 1 year ago

alpaca.cpp / gpt4all was forked specifically at this point: https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 while at time of writing, we are here: https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a

There is exactly 180 commits between then and now, obviously too many to test them manually. Here is the full list of commits for reference:

Commit list ```plain 9cbc404ba6699a9ba4925ea25a60552b13491c7a <--- now b51c717d5cf9181c33afcb84554e47f6d539c891 0ba76c1e73ae21038b80bfb5a746157376c88173 cea1c859483a5cfc7e2b31a06f8561d7a7604870 f202ada131f60059112a948f660b2e0ac93d049a 3b44d30d9b618f0f2eb9abcfe912770a4e7d85d4 61cbfff5c95e45236883b1b60e025f8f6fa8c8a3 d9ad104440d84a0cc0734bff47ef0ba41ba740c4 b467702b87461543c75013207e9adc6d20dcc01d 516d88e75c9e768c0001a452dbad212494c586b3 53635c081c49321d523567112f9fddfbba6b787b 41318d708ed196ff727dce14d263a64b23c7333d a6956b25a1c783e5e96fe06c9c00438f846ef047 83df5639eb182ed7c122382907691d8baa3c32df a5c42c4b13b3be9e58fe8f9adbb6ee60417674a6 5a5f8b1501fbb34367225544010ddfc306d6d2fe f1217055eaedfc7214be93d98e529cae89830430 7f4c5c66514227c3870c2bd189fb0609fdd0de10 2a98bc18ea34dbf15f261a0df37080e588a189d1 d0aaff571cd5c316b68e3e11d57e274bfd2bd457 d0330fd783d7c67349cdcce4a56604ef0aeccdb5 99c5b2765422232ebb4414f5a63693d734406a7f 692ce3164ef1201ecb9cfad315cc0a08b965adb8 96f9c0506fa81cada6f96f45768c34f45406c4bb d502bc7c9d9d6dfb3a09aea404395b666d7b374d 436e56193199a1625f8c561069f702e8840a9e08 20e1e84884376b3fb44ffbfd48d478b2934b0b5e c1f885067c61191a07a1aedf684168dda62f3f71 e0670260fb50a882b37074112b1881fb0820cf77 28ba975aea1dcae2f31770516f5d542ff177771e a6bdc47cba23713a22ade47dd65b6afeb8009ff4 7b8dbcb78b2f65c4676e41da215800d65846edd0 4b8efff0e3945090379aa2f897ff125c8f9cdbae 7e5395575a3360598f2565c73c8a2ec0c0abbdb8 34c1072e497eb92d81ee7c0e12aa6741496a41c6 939ad2d3a56815f480b6fd5ea432a7ee576a7e6b 8c2ec5e21d580c99e257c3cfddcf21fa53229aa4 b391579db92f095666be1d979899b54ae0981573 7a87d31f4f0c37bbb2ea695929fa4fe3ad579cda 348d6926ee31d4476f9b90e1a627b0925a70f847 33e35b8fe8f09adcac0632e9cece62e1dd629f7d 19726169b379bebc96189673a19b89ab1d307659 f732695cd57fb41e3a1be625cec4edf5be45b40a 2f7bf7dd7cd7299874d582f7f34834418abf4057 34ab5268432fd287caa68d60bdd8aef411def3fa c2b25b6912662d2637d9c6e6df3a5de931e0d7ce 79b2b266db6b198b5af450982c3cd61120fac951 e2d490dafd860eaaaf9aa8008ab790527d556daf 03f7e335604b3d68f74995aa2ccb4955833ee423 55ad42af845127bd0eb0c1f36f327ecec83f4bca 459e93cce07cab9052c06b5bf360819893442e1e a316a425d04027453dc0fd45f003b647c12f66f9 ecbe466a364876927994e2f1ec14f4d82301d201 502a400192013d3e95ed87b777e8fa3bec45713c 09aecbf6283bbce9449e2d96000073145aaaf5fc 4640eff23d341a0273587800e17ff4a378132d60 ab77d7631211b299cb734bea6ad1f74324154150 29b7baab670ae8b76ac0da21c2ded69ff18971ee 4a7129acd2e939b92d70dd568c746f2fa078232c 6b6dbc8910c6d53f4d96c46c8fcec70e2cd435d8 2a2e63ce0503d9bf3e55283e40a052c78c1cc3a8 e899bf54b291e8c84173a0e534a2c262f3f63229 fbd4d38c647f82b2598291ea9b8d0c09ac1ffb8c 58e6c9f36f97d0a3e287b97256dc5f6b0e9fb5ae 36d07532ef7ccf0bdc12e050472f359a6794957f 6f1ee4b640912211a4b07965c585d327e32e734d 8520fc310eab87f2c4612f2a00d4adbd44a20d0d b3f460e94139cb24b0af81cc8bc10eb86269d704 04c6f5ed6fafd63601fa06757877ed5ccf9d5991 7a9b6c3a8bdc1cb75fefc826dfaa7331eb63695d 31572d966531f7d768eb773322016ab78eb6e835 f4f5362edb01b05c383b23f36d7b3489c77061b5 863f65e2e32dc1e6d23c96a4811bf382d6b2a548 afd220d9c665e4c19107120ace2f0cb742e28aa1 481044d50cfe8eaa6cd0c1a1b445680e4b0b3ebc 563cdc391dde140f1084d1012234e8e6f57f881f 8d4a855c241ecb0f3ddc03447fe56002ebf27a37 b6b268d4415fd3b3e53f22b6619b724d4928f713 3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0 4870e455b3653f7d7769fa5772b2c90ffad088df 483bab2e3d4a868fe679d8bb32827d2a4df214dc 404e1da38ec8025707031a8027da14dc1590f952 4cc053b6d5e9df7ac21fa06b7208a70c156d4d7a 0ba5a3a9a5efedb1aeecbbc70a4e9825542472d5 2e17dfd80a473099dacc0f41c9146d233c6a5972 20a1a4e09c522a80e2a0db51643d25fa38326065 ad072fc5ad6f6905a7224ff6ea07c0644aa075b1 ea10d3ded2994106596ddf8e4ed02741b3e053e6 a18c19259a3cb9dec332d613e8f15704f678a468 a50e39c6fe36be3de0941b3c05aaf9c37912fd47 a140219e81cfb80356438112cd2290d701b282bb 8a3e5ef801339e57b9b0449220e9ffb11a6648e2 8eea5ae0e5f31238a97c79ea9103c27647380e37 93208cfb929c2323e5d2ac6bf354e278040e70ed 03ace14cfd68a1289ac3b76563534c8ee72a2e53 e4412b45e395981068d2850d3fa04cc16c77d70d f7dc43bc0d759732815856183246f167111587ad ee8a7887865a893be208e0a92d6d94d2cb66a789 69c92298a9e36dc2363b3bf50452976ce49487b3 97940520e8fd49c56bb29b71cc350190b723513f 305ba6f0e6daa3796aad9dd18053a1945dd4cc58 4122dffff958cd137175b58f1f27c0913528d7ba 56e659a0b271436e24813a801640d015e7b05328 40ea807a972ec7b5a426f034ebfa593b5e7a06ed d5850c53ca179b9674b98f35d359763416a3cc11 ae44e23ee36c02da0e37ab508a4b473ace724f8e 928480ef5b7b03d7a07e98286aebe3d8b24457d9 56817b1f882b1894daa4051d0de0bf9a0926d315 f5a77a629bd0f37ae1696747633ab42a5530ec15 da0e9fe90ccf6e73597eb19dd0cfc0a28363fb3b e6c9e0986c79ba1cc8848879b2fcce979f9b4672 01a297b09932e29f3319d6588977c32a926c7907 3366853e41fcc818222a0271c76b6106179106fb 3f9c6135e45ae3f520b1e17197004cc60c9ca45b 0f6135270839f0715843c4d480c63ae150def419 353ec251a42491f5192c48561da4b444ef67f23c 89d5d90f3b6d25f134da7a8e252c3432bffcf674 16ffc013c62f22bdaa3cdc022d7a13fd952d73fc 486ae645fd3eda8b9d7413d5ff34fb65a3e337fb 3ab3e6582f7320c2b6568c892fdfc8215caf7e6c f157088cb75f23208abc92b473a132ef3f7a7f15 c86ba036e613d46815501a4c6775117c9fc7afce 1daf4dd71235dbbf537738e7ad53daad8d97586f dc6a845b8573cd7d06c6b295241d26f311602a1f 6a612959e1b6c37b68b6b141329751a2902b1030 d5f56a5e5a0069329a81f96460221e7afb1daddc 3bfa3b43b7319b71853bfc7d3cf4e9767c24bbc8 715d292ee0e34d27f27af43d7feaad1f1344981d c98ae02668a25916954b1653e25a5a35ca048d63 c3b2306b18a087799acc431e485b8a2e3162cd52 975d2cebf97ce888fa0aeee6f5ac774d7135891f e0ffc861fae5ac8b40ce973f822d03db02929d36 8f644a0a859938c787d329d27f98e03c58d7df27 eb34620aeceaf9d9df7fcb19acc17ad41b9f60f8 2e664f1ff413995506c9a54f3a8d5b8c64e37a91 8cf9f34eddc124d4ab28f4d2fe8e99d574510bde bd4b46d6ba504b99c936f43fc014529adffb6048 6b6d5b5024faaf82019d08cde5e8a9d69c6ca316 a791a68b613b162c88a83f5f0225223bc167c762 0f1b21cb90ac6b84a9af70cafb8e13b5389e3b32 074bea2eb1f1349a0118239c4152914aecaa1be4 5cb63e2493c49bc2c3b9b355696e8dc26cdd0380 da5303c1ea68aa19db829c634f1e10d08d409680 4545539d718cf88f4c3a76669b8ac2e26cd8a1e5 edeba283665591f2f726024a92efe4b0b40434b3 5c19c70ba631a8f5d54feb6634e0eea178911a84 24568371ae0d7caf85164abe4753f36a7dba0288 7392f1cd2cef4dfed41f4db7c4160ab86c0dfcd9 ad5fd5b60cfdfbfb22b0f2bc9e9f6c9692768f8d 368d0c8a9ebae16a20e1c8971b21ee888bdefad5 50fae10d0339f2bd639f69dd679c0201d939a265 084e2f0ec081c929343d44b09df07ae87cd1ed32 0b366e735729327476ec31da02de3c9c9771ddfb 160bfb217da5038ccbd74438f9f16a16012d7866 c494ed5b94b429d3d73721235e78c9f5fa6e5652 c1c7026b470ced0b8a6c67e968c04bb47864def1 467b149761cd63248b00d6ffb204d50a4cbb451a 70f01cb8632f73b5cf70428608b89cd3c0775d23 a4e63b73dfa1894387926cc8072b5f36deebf0a5 9e1707218a24ff758c7b623594f8c0ce5e12eb6c 22213a17b56336bbea384a572a9484ce208c0333 d7def1a7524f712e5ebb7cd02bab0f13aa56a7f9 6f61c18ec9a30416e21ed5abfb1321bdb14979be 1e5a6d088d0f3a967c6e86298a756daec9e8df12 554b54152145c30618bac171efb712cf4a7d1e96 d3f202d57b694376cef6f381a6b6901825c3f6d9 e03e359730c127f888fcf00e93375771bc0a3500 a81d0c2a171a4446e6a21a3ec74a0c0768d71184 b2de7f18dfbb93463eeb5b4392117bbe82d5bd1b a29274789309029fd88a9465e6d0832d4632272b c9f670a17755311aa28c411f5c7f3c8c05434770 4f546091102a418ffdc6230f872ac56e5cedb835 e81b9c81c101f64531ef0fa1ee6b77d562635652 367946c668757532deed929e1d78673c6ac6bcb8 6b0df5ccf360fe5c015f6607f0375bfc6849005e 2af23d30434a677c6416812eea52ccc0af65119c 904d2a8d6acd667c9633138d45a361d40fbf76d0 721311070e31464ac12bef9a4444093eb3eaebf7 ac15de789547e5a6e93df552e787379b3a23ef26 273abc47ff9dd899b3c4f58acd19d4649e90d6b4 9b4a15b17d8395eb075379b140fcd0b0283f4ef6 <--- alpaca.cpp/gpt4all was forked here ```

I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start

That would narrow it down to exactly where it happened.

Unfortunately the GitHub action runners are very limited, otherwise incorporating a performance test to be ran on every pull request would be easy to incorporate to the test suite to nip these issues in the bud before the situation devolves like this that we have to go bug hunting through a list of 180 commits.

sw commented 1 year ago

I'm thinking it would be a good thing to measure performance more accurately.

Wall-clock time is not good enough, especially if there's other things happening on the system. You have to let it run for a long time to get a good enough average.

Another pitfall may be that a test suite causes downclocking of the processor. So the first test will get a cold processor running at full speed, and the later tests will have to run on a hot, slow processor.

Maybe getrusage(RUSAGE_THREAD, ...) would be useful here to get per-thread usage information, which could then be collected by the thread pool manager? Of course you could use RUSAGE_SELF to get the data for all threads combined, but maybe we want to see if all threads get used equally. We might also look into RDTSC and whatever the equivalent of getrusage is on Windows.

anzz1 commented 1 year ago

@sw Very good points. "Preheating" a processor before a set of runs would ensure the stability between other sets of runs. Then again, the first-run(s) advantage quickly dissipates when doing a larger set of runs. Especially on desktop computers with good cooling this isn't much of a problem unless you happened to just fire up your PC and instantly start a test. There is also the problem, especially in the case of Windows, beginning in Windows 8 and onwards, each new iteration being worse than the one before, of the OS being so noisy that it affects the performance greatly so a large set of runs is required to achieve anything resembling an accurate average.

That would be more about perf testing generally though, as in this case where the perf drop is significant enough to be able to be visually inspected, probably just a run or three would be enough to narrow down where it happened.

I think that is the most feasible way to go about this since manually trying to figure out from the now much-changed codebase where the problem lies would be harder than just letting a script run and perf test tell you that.

edit: the downclocking part you linked, that's exactly the thing I was trying to remember as I actually posted something about earlier of AVX512 sometimes having worse performance for some workloads especially in the earlier Intel processors which first introduced the set, but didn't remember what was exactly the cause but that was definitely it. That whatever improvement AVX512 brought to the table was offset by the downclocking so that the overall performance actually decreased.

slaren commented 1 year ago

I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start

This can be done as a binary search too, the git bisect command should help with that.

anzz1 commented 1 year ago

This can be done as a binary search too, the git bisect command should help with that.

Very interesting, I actually had no idea such a command existed. Seems useful in many cases. However in this case since there also was the format change somewhere in the middle, it'd simply be easier to go about it sequentially until you run to the one where the format was changed, change the model only once and proceed. Or run two binary searches for the ones before and after the model change, that is also an option.

To be honest when already went to the trouble of setting the script up and have some compute time set aside for it, a sequential run would also give a log of the performance deltas of all the commits and which ones increased performance and which ones decreased it, as it might not be any single commit that's causing it but a pile up of smaller decreases here and there. There obviously has been steps forward but also steps back in the bunch.

KASR commented 1 year ago

I've written a small python script to benchmark the token time as a function of number of threads. I've added the script in attachment if anyone want to try it. ( --> benchmark_threads.txt , I had to change the extension in order to upload ) It could be useful to benchmark performance for different versions. If not just ignore this message :)

Below you can see the result from my pc. I'm using windows 10, the typical system information looks like:

system_info: n_threads = xx/ 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I didn't go up to the 36 threads, you can see the results below. The script runs each prompt 5 times and plots the average token times. For some reason the timings go up at around 18 threads (i.e. the number of physical cores). I will try it again later to see how robust the timings are.

Feel free to propose suggestions to make the benchmark more reliable.

Token_time__threads

Edit: I've updated the plot so that it includes the eval and prompt eval as well (don't mind the typo's). It's really strange why the performs drops around 18 (i.e. the number of physical cores) and afterwards drops again...

Token_time__threads_v2

slaren commented 1 year ago

@KASR at the moment it seems that you are only measuring the prompt eval time, I would recommend considering the prompt eval time and the regular (generation) eval time separately. You can safely assume that 1 run = 1 token in the eval time.

anzz1 commented 1 year ago

@KASR That looks good, if you were to do a similar graph but where the x-axis was not threads but commits, x = 0 being https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 and x = 180 being https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a

It would solve our problem right then and there of figuring out where the increases and decreases happened.

MillionthOdin16 commented 1 year ago

Wow, thanks for the critical thinking guys. You've mentioned some pretty interesting points. It's pretty crazy to see what affects performance, and looking at some of the discussions it seems like there are things like Intel perf cores they can have a significant impact (although not in my case ๐Ÿ™ƒ).

I can definitely help with testing and performance metrics, I just need to make a script that'll get reliable builds between versions for my environment. It's pretty picky and often needs tweaks to make the build succeed.

One of the differences/struggles right now is that the current llamaCPP gives much more performance metric info than the build used in gpt4all. So it's hard to see the specific timings in the older gpt4all version. Apart from that, I'd want to make sure that the info I'm collecting while running builds for specific commits is actually the info that will help us.

But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.

As for my build and build process, I have a Ryzen 3900x (12c, 24t) and use CMake and ninja to build my executables. I've also built with blas linked, but haven't seen a noticeable difference while using the library vs not. Other than that I use avx avx2 maxv maxv2 f16c sss3 on Release. I usually run with -t 8. And the models I use are the 4-bit quantized 7B.

I guess the ggml_mul_mat change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.

Where should I expect to see the performance increases when I'm running with BLAS? Is it during larger completions after the prompt is loaded?

I could also do things in WSL2, but I'm not sure about the performance impacts, which is why I currently don't use it. If you think it would be better let me know.

Again, awesome job guys! You're having a huge impact on making these models accessible to normal people ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

anzz1 commented 1 year ago

@MillionthOdin16 For Windows definitely the most common configuration (4 or 8 threads, AVX=yes, SSE3=yes, F16C=yes, AVX2=yes , AVX512=no , BLAS=no , WSL2=no) would be the best to base the benchmarks on. Obviously if you want and have the time for it, more is always better.

The most important thing to know would be the performance data between commits starting from https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 and ending to https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a .

That is the thing which will help in understanding where the decreases happened. Since there has been many commits with optimizations and performance increases, it makes no sense that gpt4all/alpaca/llama-9b4a15b is faster, it should be slower because they don't have any of the recent optimizations. That leads to only one conclusion that there must have been significant decreases at some points in the timeline. It can be something not easily seen like compiler dropping inlining because of some change (inline and force-inline isn't same, and compiler can even drop force-inline) or a mistake somewhere, cannot know really. Only data can save us. :smile:

cyyynthia commented 1 year ago

I've been able to observe some performance degradation on Arch Linux as well. I didn't have time to look for the precise commit yet, but I found the potentially helpful information that the degradation seemed to have happened after the ggml format migration, which may help simplifying the exploration. I think it would be nice if someone else could confirm this and make sure this isn't something that happens for me only ๐Ÿ˜…

I'll keep doing some exploration, but here are the numbers I observed so far:

System info: Arch Linux - CPU: Intel Core i7 7700K. Compiled using make. n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 All of the runs were run with the same parameters, using the 7B model (Q4_0)

I only use 4 threads, as 8 threads tend to cause performance degradation for me.

ed3c680

llama_print_timings:        load time =  3013.13 ms
llama_print_timings:      sample time =   277.93 ms /   512 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =  1990.05 ms /    14 tokens (  142.15 ms per token)
llama_print_timings:        eval time = 131235.49 ms /   511 runs   (  256.82 ms per run)
llama_print_timings:       total time = 137390.49 ms

074bea2 (first commit using new format)

main:     load time =  1953.12 ms
main:   sample time =   248.32 ms
main:  predict time = 102235.23 ms / 194.73 ms per token
main:    total time = 105018.24 ms

9b4a15b

main:     load time =  1385.53 ms
main:   sample time =   247.39 ms
main:  predict time = 101518.52 ms / 193.00 ms per token
main:    total time = 103729.24 ms
x02Sylvie commented 1 year ago

This might come in handy for tech savvy lads here who need slight performance boost https://github.com/ggerganov/llama.cpp/pull/295 you may need to modify code a bit to make it work on latest commits however

anzz1 commented 1 year ago

Thank you @cyyynthia , something is definitely up here.

Interestingly, the new format made load times go up 40% but the sampling and predict times stayed the same (within margin of error)

I've only now woken up to this since you don't tend to see marginal changes (like in general, in anything) as I've always been on the latest version and didn't notice the performance degrading gradually. But obviously now everything is much slower, loading/sampling/prompt evaluation, and this is a high priority issue. For anyone trying out gpt4all/alpaca.cpp vs current-gen llama.cpp will find it painfully obvious while for someone just developing incrementally this has gone by unnoticed.

cyyynthia commented 1 year ago

After a bit more digging, #439 seems to be a very clear culprit of performance drops: eval time goes from 175ms (better than before!) @ 404e1da to 244ms @ 483bab2.

It seems other timings do fluctuate in some intriguing ways, with increased loading times and sample times. I'll try to put together a test script and plot the evolution of these values over time on my machine.

cyyynthia commented 1 year ago

I've done a first test to see, and I've already gathered some interesting data. I have ran my script on 18 commits from the range cited earlier, skipping 10 commits every time. Doing it on the full range will take a while so I wanted to see what I could get without too much precision.

Here are the following graphs I've gathered, and the raw csv data if you want to explore it some more. I'll run it on all 180 commits later, probably tomorrow.

result.csv

execution time eval token time memory usage

ggerganov commented 1 year ago

@cyyynthia

Thank you very much for this effort and the detailed analysis.

The memory increase is expected and easy to explain - see https://github.com/ggerganov/llama.cpp/pull/473#issuecomment-1483399313

The eval time increase is surprising. I am mostly running on Apple Silicon and it looks like I have overlooked this performance degradation. Will look further for ways to resolve it

cyyynthia commented 1 year ago

I figured the memory increase was expected, but I found it quite interesting nonetheless. The load times graph is also quite interesting:

loading times

If anyone is interested in running the test on their machine too, here is the script I used (beware, it's written in NodeJS because I'm not very much of a Python dev and it was much faster for me to put together, and it might need slight tweaks depending on your machine ๐Ÿ˜…) Simply put the script in your llama.cpp folder (next to the Makefile) and run it.

https://gist.github.com/cyyynthia/43784451936e2a608566c42b0bacceac

note: the script doesn't work with ggjt format at this time the script now supports all 3 formats. it expects them to be placed in specific folders, check the script for more details.

KASR commented 1 year ago

@cyyynthia fyi I'm also trying to run all the commits, I've converted the code to python and extended it a bit so that it works with windows/cmake/vs. I run each commit 5 times to obtain average timings. I will try to post some results if it finishes without errors.

I still have to figure out something: at which commit do you switch from llama.exe to main.exe?

LLAMA_PATH = 'C:/DATA/TestLLama/CommitTest'

LLAMA_EXE_0 = os.path.join(LLAMA_PATH, 'build/Release/llama.exe')
LLAMA_EXE_1 = os.path.join(LLAMA_PATH, 'build/bin/Release/main.exe')

Or did you use another trick to avoid the folder/name change?


edit: it got stuck after 35 commits at 2456837, i will investigate later what's happening...

The results thus far are gathered in the csv --> result.csv

cyyynthia commented 1 year ago

Or did you use another trick to avoid the folder/name change?

I started collecting data since commit 2d64715ad475f192a4004a52d134c67ccb6f44ad, anything before is skipped and it was already named main at this time. (at least, using make)

it got stuck after 35 commits

24568371ae0d7caf85164abe4753f36a7dba0288 and 5c19c70ba631a8f5d54feb6634e0eea178911a84 are broken commits and do not work (infinite loop, it triggered a timeout I added to my script). You'll also experience compile 3 compile errors, at f5a77a629bd0f37ae1696747633ab42a5530ec15 928480ef5b7b03d7a07e98286aebe3d8b24457d9 and ae44e23ee36c02da0e37ab508a4b473ace724f8e. I updated my script locally before running it so it would be resilient to these issues (didn't reflect that on the gist).

I also adapted my script to skip commits which were not related to ggml/llama/the example, to speed up the process - something I haven't reflected yet on my gist.

Running 5 times and getting an average is a great idea, I haven't done that for my own thing because it would take forever and my poor CPU has been suffering enough ๐Ÿ˜… That being said, I ran it twice, and I must say some results seem to indicate it'd have been much better to do 5 runs and peak the average.

Here's my 2 runs, from 2d64715ad475f192a4004a52d134c67ccb6f44ad to ee0c40dd6de8c3c658ae43199939ef40bb1cf408 (excluding commits which didn't introduce any code change):

result_first_run.csv result_second_run.csv

I also updated my gists with the updates I listed above (handle llama run issues, skip irrelevant commits).

KASR commented 1 year ago

2456837 and 5c19c70 are broken commits and do not work (infinite loop, it triggered a timeout I added to my script). You'll also experience compile 3 compile errors, at f5a77a6 928480e and ae44e23. I updated my script locally before running it so it would be resilient to these issues (didn't reflect that on the gist).

Good to know! I just skipped a few commits and let the loop continu at 074bea2 since this was included in you first run. As a result, I seem to have jumped over the 2 faulty commits.

The loop successfully passed the 3 commits where you had the compile errors, this might be an OS/compiler/etc. issue...

So far, the loop is at 119 / 193 commits.

I will restart the loop once it complets: I had to restart a couple of time (continuing at the commit where it failed) to account for special cases (e.g. llama.exe -> main.exe etc.). Even though I'm looping 5 times over each commit I would like to run the loop in a single run while the pc isn't doing anything else so that we have consistent timings.

xor2003 commented 1 year ago

system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | On Ubuntu 22.04.2 gcc 11 or clang 15: For llama.cpp + llama 30B I see very slow responces: long answer and each word is 1-2 second. (Same time the startup with mmap is extremly fast) But for alpaca.cpp 13B - the answers are much faster

KASR commented 1 year ago

@anzz1 @cyyynthia @ggerganov here are the results of the run over 191 commits, as the title indicates I've evaluated each commit 10 times to obtain the average, max and min token evaluation time. You can find the full results here --> Benckmark_commits_results.csv

I've uploaded the script as a gist -->benchmark_commits_llama_cpp.py

For completeness, I'm using windows 10, cmake 3.24.3, VS22. I'm using an intel xeon w2295, but limited the threads to 6 to be more representative. The benchmark was done using python, each time deleting the build folder, switching commit, rebuilding. 7 commits failed to run which I excluded both from the graph and the results in the csv file.

I've used the following options for the inference:

-s 147852369 
-t 6 
-p 'Here is a long story about how programming came to be: '
-n 128 
-c 1024 
--top_k 40 
--top_p 0.95 
--repeat_last_n 64 
--repeat_penalty 1.1 

Below you can see the plot for the token timings:

Average_token_times

edit @jart below you can see the average loading times on my system:

Average_load_times

xor2003 commented 1 year ago

For my test on ubuntu:

  1. I done a bisect on llama.cpp with 7B model. And if compare master with Alpaca.cpp it was not performance degradation.
  2. I can see 30B is much slower if compare with 7B on master. Is it fine? I have used the following command: ./main -t 7 -s 1 -m /llama/models/30B_/ggml-model-q4_0.bin -p "Count to 3."

Here you can find script for git bisect: llama.cpp.zip

sw commented 1 year ago

@KASR Thanks for your effort!

Is the sum of the 4 partial times supposed to be roughly equal to the total time? Because there's some variation there in your data: image The three large steps are f5a77a6, 29b7baa, 78ca983.

ggerganov commented 1 year ago

@KASR Very nice work! Your investigation ends right before the latest AVX optimizations:

I expect after these changes, the timings to be back to normal. Or probably after we merge #654

@sw After the mmap changes, the load time is incorrect:

https://github.com/ggerganov/llama.cpp/blob/6e7801d08d81c931a5427bae46f00763e993f54a/llama.cpp#L1681-L1685

Currently, the reported load time includes not only the page faults, but also the prompt eval time. So effectively, you get the negative number since the prompt eval time has been accounted 2 times. We have to fix this.

KASR commented 1 year ago

@sw I've uploaded the script that i used as a gist ( benchmark_commits_llama_cpp.py )

The timings were simply extracted from the llama printings. But indeed, when you add the printed timings there is still a bit missing.

e.g. below: 8922.26 - ( 5603.04 + 1169.53 + 21.99 + 1836.83) = 290.87 ms

image

x02Sylvie commented 1 year ago

https://github.com/ggerganov/llama.cpp/commit/f5a77a629bd0f37ae1696747633ab42a5530ec15 and https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.

slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast

MillionthOdin16 commented 1 year ago

I've noticed this as well. Tokens per second definitely drops as the conversation grows.

On Sun, Apr 2, 2023, 09:59 Sylvie @.***> wrote:

f5a77a6 https://github.com/ggerganov/llama.cpp/commit/f5a77a629bd0f37ae1696747633ab42a5530ec15 and 483bab2 https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.

slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast

โ€” Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1493343076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AELMOXOXTFNY6EXWWDW7GA4NANCNFSM6AAAAAAWMIWCKY . You are receiving this because you were mentioned.Message ID: @.***>

Seltsamsel commented 1 year ago

f5a77a6 and 483bab2 seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.

slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast

Just to make sure, you're saying that before these commits there was no slowdown for larger contexts? Or just a less significant slowdown?

cyyynthia commented 1 year ago

Tokens per second definitely drops as the conversation grows.

I've been observing this too and I got curious; so I've did some measurements to see how bad the slowdown was. I've plotted all the token times for -n 2000 (2048 ctx). To export the timings, I modified llama.cpp line 1003 and made it print to stderr the token time.

This graph is a single-run on a slightly noisy environment, so take the graph with a grain of salt. I'm definitely intrigued by the sudden bump in token time at the middle, and the acceleration of the performance degradation. I'll try to run more tests about it, and try to extract values from older versions as well. image

edit: this test was run on 437e77855a54e69c86fe03bc501f63d9a3fddb0e

Seltsamsel commented 1 year ago

I'm definitely intrigued by the sudden bump in token time at the middle

That could be context swapping, see

https://github.com/ggerganov/llama.cpp/blob/437e77855a54e69c86fe03bc501f63d9a3fddb0e/examples/main/main.cpp#L246-L265

However, if I understand it correctly, it should only happen once every about n_ctx/2 and not for every token, what the graph implies. Anyways, for your example the location should be around 1024 which fits perfectly.

Edit: nvm, context swapping should happen first at n_ctx not n_ctx/2 so you wouldn't see it here. My mistake.

cyyynthia commented 1 year ago

I don't think context swap is to blame. It happens once the context is full (if I understand correctly), which in my example should never happen (since I start at 13 (prompt), and generate 2000 tokens). Only once the first swap has been performed, since we "free" half of the context space, your assumption would now hold.

Even if a swap happened, my understanding is that it would generate a freeze in generation, and go back to work as usual with no impact on the token time (or if anything a slight decrease, based on the observations that the more the context fills, the slower it gets).


To prove this, I ran the same test with -n 1024 -c 128. I indeed did observe pauses in the generation (as expected), and here's the graph (red lines mean a context swap): image

Swaps occurred after 115, 178, 241, 304, 367, 430, 493, 556, 619, 682, 745, 808, 871, 934 and 997 tokens.

Seltsamsel commented 1 year ago

Yes, you're right, I just realised too that my comment makes no sense. But it's interesting that context swap actually resets the increase of time per token.

cyyynthia commented 1 year ago

I've analyzed data for 2000 tokens on 6 different commits and established that 483bab2e3d4a868fe679d8bb32827d2a4df214dc is definitely a HUGE regression for large prompts. The run seg-faults mid-run, but the regression in token times for large inputs is extremely visible already and confirmed by 4870e455b3653f7d7769fa5772b2c90ffad088df, jumping from 415s runtime to 1673s, a 4x increase!

The latest versions do close to nothing to compensate for the massive performance degradation caused: image

You can see at the beginning, newer versions are faster (as expected with all the great work that has been going on in optimizing SIMD thingies), but quickly get unbearably slow compared to earlier versions.

Raw CSV: token_times.csv

edit: note: for commits which didn't have --ignore-eos, I've patched the code to make it behave as current versions with --ignore-eos enabled to make sure it generates 2000 tokens.

edit 2: My data also seem to suggest this is the only major regression, and on my machine and on Linux f5a77a629bd0f37ae1696747633ab42a5530ec15 doesn't come with a regression, at least not as bad as the one shown above.

KASR commented 1 year ago

@cyyynthia out of curiosity, if you use the latest commit and remove ggml_cpy from the V_trans (i.e. revert the changes from #439 ) do you then retrieve the original timings?

Seltsamsel commented 1 year ago

@cyyynthia out of curiosity, if you use the latest commit and remove ggml_cpy from the V_trans (i.e. revert the changes from #439 ) do you then retrieve the original timings?

Unfortunately, that's not possible (at least not easily) because of ecbe466a364876927994e2f1ec14f4d82301d201. I assembled something similar here, which doesn't even contain most of the recent AVX optimizations and it's still way faster for larger contexts.

cyyynthia commented 1 year ago

@KASR I actually tried, but because ggml.c moved on with the change and removed some code paths that were now unused, I couldn't get it to work. I tried bringing back some code bits in a sort-of naive manner but I got broken output, so I assumed I've been doing something wrong and my results might not be valid.

If someone has a working patch to test, I'd love to give it a shot on my machine and compare the results, but otherwise I'm afraid I can't test that out. That being said, the commit right before the change (404e1da38ec8025707031a8027da14dc1590f952) is tested and on the graph.