Closed MillionthOdin16 closed 1 year ago
Screenshot on the left has FMA and SSE3 enabled, is that the one that is faster? Try building both with the same flags.
No, that's what I'm saying. The left is optimized llama.cpp, the right is unoptimized gpt4all. The unoptimized gpt4all is significantly faster than the optimized llama. So there's something wrong.
Without bisecting the exact commit that introduced a performance regression it is hard to do much about it. I suspect that it happened when the code for transposed matrix multiplication was replaced with a copy.
First I'm just trying to check if someone else can reproduce the behavior. I def wanna use llama version because it has more features haha
Double check you have built llama.cpp
in Release
Personally had small performance regression (13b model) over last 14 days, from 207 ms to 238 ms, but its no biggie
I see significant slowness when comparing on windows latest llama.cpp 30B with gpt4all lora
I also notice performance drops on x86-64 Linux, it also uses a lot more memory than before. I compiled the project following the instructions on the Readme.md
I am using the same language model for both executables
@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue
I am using the same language model for both executables
@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue
Did you try it in the last few hours? There were some commits couple hours ago that made it easy. Look at the readme section that was added for GPT4all
It kinda drives me crazy that the one dude forked llama.cpp then stopped maintaining it because other repos are forking his repo which is outdated โน๏ธ
what is gpt4all and what changes to llama.cpp and alpaca.cpp? I arrived by parachute here in the conversation, I don't really know the context, I apologize
We should definitely look into this as this definitely shouldn't be the case. I'm pretty confident though that enabling the optimizations didn't do that since when we did that #375 the perf was pretty well researched. If performance got lost and memory usage went up somewhere along the way, we'll need to look at where this happened. If it doesn't run well, everyone just going to fork from a older point in time instead.
@BrunoIsaac27
alpaca.cpp is pretty much a fork from a older llama.cpp version (which is apparently faster), but nothing much is really changed except changing a few default variables.
gpt4all is a fork from the alpaca.cpp fork with modifications tailored specifically to the gpt4all model.
Someone on Windows should try and bisect to see where this alleged degradation happens - I don't have a Windows machine. I'm pretty sure there is no performance degradation no Mac OS and Linux. The only slowness introduced, as @slaren mentioned, was the removal of the transposed ggml_mul_mat
path which led to about %10 performance loss during single-token inference (i.e. after prompt ingestion). But this should have been compensated by the various updates in the SIMD code.
I guess the ggml_mul_mat
change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.
But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.
alpaca.cpp / gpt4all was forked specifically at this point: https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 while at time of writing, we are here: https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a
There is exactly 180 commits between then and now, obviously too many to test them manually. Here is the full list of commits for reference:
I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start
That would narrow it down to exactly where it happened.
Unfortunately the GitHub action runners are very limited, otherwise incorporating a performance test to be ran on every pull request would be easy to incorporate to the test suite to nip these issues in the bud before the situation devolves like this that we have to go bug hunting through a list of 180 commits.
I'm thinking it would be a good thing to measure performance more accurately.
Wall-clock time is not good enough, especially if there's other things happening on the system. You have to let it run for a long time to get a good enough average.
Another pitfall may be that a test suite causes downclocking of the processor. So the first test will get a cold processor running at full speed, and the later tests will have to run on a hot, slow processor.
Maybe getrusage(RUSAGE_THREAD, ...)
would be useful here to get per-thread usage information, which could then be collected by the thread pool manager? Of course you could use RUSAGE_SELF
to get the data for all threads combined, but maybe we want to see if all threads get used equally. We might also look into RDTSC
and whatever the equivalent of getrusage
is on Windows.
@sw Very good points. "Preheating" a processor before a set of runs would ensure the stability between other sets of runs. Then again, the first-run(s) advantage quickly dissipates when doing a larger set of runs. Especially on desktop computers with good cooling this isn't much of a problem unless you happened to just fire up your PC and instantly start a test. There is also the problem, especially in the case of Windows, beginning in Windows 8 and onwards, each new iteration being worse than the one before, of the OS being so noisy that it affects the performance greatly so a large set of runs is required to achieve anything resembling an accurate average.
That would be more about perf testing generally though, as in this case where the perf drop is significant enough to be able to be visually inspected, probably just a run or three would be enough to narrow down where it happened.
I think that is the most feasible way to go about this since manually trying to figure out from the now much-changed codebase where the problem lies would be harder than just letting a script run and perf test tell you that.
edit: the downclocking part you linked, that's exactly the thing I was trying to remember as I actually posted something about earlier of AVX512 sometimes having worse performance for some workloads especially in the earlier Intel processors which first introduced the set, but didn't remember what was exactly the cause but that was definitely it. That whatever improvement AVX512 brought to the table was offset by the downclocking so that the overall performance actually decreased.
I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start
This can be done as a binary search too, the git bisect
command should help with that.
This can be done as a binary search too, the
git bisect
command should help with that.
Very interesting, I actually had no idea such a command existed. Seems useful in many cases. However in this case since there also was the format change somewhere in the middle, it'd simply be easier to go about it sequentially until you run to the one where the format was changed, change the model only once and proceed. Or run two binary searches for the ones before and after the model change, that is also an option.
To be honest when already went to the trouble of setting the script up and have some compute time set aside for it, a sequential run would also give a log of the performance deltas of all the commits and which ones increased performance and which ones decreased it, as it might not be any single commit that's causing it but a pile up of smaller decreases here and there. There obviously has been steps forward but also steps back in the bunch.
I've written a small python script to benchmark the token time as a function of number of threads. I've added the script in attachment if anyone want to try it. ( --> benchmark_threads.txt , I had to change the extension in order to upload ) It could be useful to benchmark performance for different versions. If not just ignore this message :)
Below you can see the result from my pc. I'm using windows 10, the typical system information looks like:
system_info: n_threads = xx/ 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
I didn't go up to the 36 threads, you can see the results below. The script runs each prompt 5 times and plots the average token times. For some reason the timings go up at around 18 threads (i.e. the number of physical cores). I will try it again later to see how robust the timings are.
Feel free to propose suggestions to make the benchmark more reliable.
Edit: I've updated the plot so that it includes the eval and prompt eval as well (don't mind the typo's). It's really strange why the performs drops around 18 (i.e. the number of physical cores) and afterwards drops again...
@KASR at the moment it seems that you are only measuring the prompt eval time, I would recommend considering the prompt eval time and the regular (generation) eval time separately. You can safely assume that 1 run = 1 token in the eval time.
@KASR That looks good, if you were to do a similar graph but where the x-axis was not threads but commits, x = 0 being https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 and x = 180 being https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a
It would solve our problem right then and there of figuring out where the increases and decreases happened.
Wow, thanks for the critical thinking guys. You've mentioned some pretty interesting points. It's pretty crazy to see what affects performance, and looking at some of the discussions it seems like there are things like Intel perf cores they can have a significant impact (although not in my case ๐).
I can definitely help with testing and performance metrics, I just need to make a script that'll get reliable builds between versions for my environment. It's pretty picky and often needs tweaks to make the build succeed.
One of the differences/struggles right now is that the current llamaCPP gives much more performance metric info than the build used in gpt4all. So it's hard to see the specific timings in the older gpt4all version. Apart from that, I'd want to make sure that the info I'm collecting while running builds for specific commits is actually the info that will help us.
But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.
As for my build and build process, I have a Ryzen 3900x (12c, 24t) and use CMake and ninja to build my executables. I've also built with blas linked, but haven't seen a noticeable difference while using the library vs not. Other than that I use avx avx2 maxv maxv2 f16c sss3 on Release. I usually run with -t 8. And the models I use are the 4-bit quantized 7B.
I guess the
ggml_mul_mat
change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.
Where should I expect to see the performance increases when I'm running with BLAS? Is it during larger completions after the prompt is loaded?
I could also do things in WSL2, but I'm not sure about the performance impacts, which is why I currently don't use it. If you think it would be better let me know.
Again, awesome job guys! You're having a huge impact on making these models accessible to normal people ๐ฅ๐ฅ๐ฅ
@MillionthOdin16 For Windows definitely the most common configuration (4 or 8 threads, AVX=yes, SSE3=yes, F16C=yes, AVX2=yes , AVX512=no , BLAS=no , WSL2=no) would be the best to base the benchmarks on. Obviously if you want and have the time for it, more is always better.
The most important thing to know would be the performance data between commits starting from https://github.com/ggerganov/llama.cpp/commit/9b4a15b17d8395eb075379b140fcd0b0283f4ef6 and ending to https://github.com/ggerganov/llama.cpp/commit/9cbc404ba6699a9ba4925ea25a60552b13491c7a .
That is the thing which will help in understanding where the decreases happened. Since there has been many commits with optimizations and performance increases, it makes no sense that gpt4all/alpaca/llama-9b4a15b is faster, it should be slower because they don't have any of the recent optimizations. That leads to only one conclusion that there must have been significant decreases at some points in the timeline. It can be something not easily seen like compiler dropping inlining because of some change (inline and force-inline isn't same, and compiler can even drop force-inline) or a mistake somewhere, cannot know really. Only data can save us. :smile:
I've been able to observe some performance degradation on Arch Linux as well. I didn't have time to look for the precise commit yet, but I found the potentially helpful information that the degradation seemed to have happened after the ggml format migration, which may help simplifying the exploration. I think it would be nice if someone else could confirm this and make sure this isn't something that happens for me only ๐
I'll keep doing some exploration, but here are the numbers I observed so far:
System info: Arch Linux - CPU: Intel Core i7 7700K. Compiled using make
.
n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
All of the runs were run with the same parameters, using the 7B model (Q4_0)
I only use 4 threads, as 8 threads tend to cause performance degradation for me.
ed3c680
llama_print_timings: load time = 3013.13 ms
llama_print_timings: sample time = 277.93 ms / 512 runs ( 0.54 ms per run)
llama_print_timings: prompt eval time = 1990.05 ms / 14 tokens ( 142.15 ms per token)
llama_print_timings: eval time = 131235.49 ms / 511 runs ( 256.82 ms per run)
llama_print_timings: total time = 137390.49 ms
074bea2 (first commit using new format)
main: load time = 1953.12 ms
main: sample time = 248.32 ms
main: predict time = 102235.23 ms / 194.73 ms per token
main: total time = 105018.24 ms
9b4a15b
main: load time = 1385.53 ms
main: sample time = 247.39 ms
main: predict time = 101518.52 ms / 193.00 ms per token
main: total time = 103729.24 ms
This might come in handy for tech savvy lads here who need slight performance boost https://github.com/ggerganov/llama.cpp/pull/295 you may need to modify code a bit to make it work on latest commits however
Thank you @cyyynthia , something is definitely up here.
Interestingly, the new format made load times go up 40% but the sampling and predict times stayed the same (within margin of error)
I've only now woken up to this since you don't tend to see marginal changes (like in general, in anything) as I've always been on the latest version and didn't notice the performance degrading gradually. But obviously now everything is much slower, loading/sampling/prompt evaluation, and this is a high priority issue. For anyone trying out gpt4all/alpaca.cpp vs current-gen llama.cpp will find it painfully obvious while for someone just developing incrementally this has gone by unnoticed.
After a bit more digging, #439 seems to be a very clear culprit of performance drops: eval time goes from 175ms (better than before!) @ 404e1da to 244ms @ 483bab2.
It seems other timings do fluctuate in some intriguing ways, with increased loading times and sample times. I'll try to put together a test script and plot the evolution of these values over time on my machine.
I've done a first test to see, and I've already gathered some interesting data. I have ran my script on 18 commits from the range cited earlier, skipping 10 commits every time. Doing it on the full range will take a while so I wanted to see what I could get without too much precision.
Here are the following graphs I've gathered, and the raw csv data if you want to explore it some more. I'll run it on all 180 commits later, probably tomorrow.
@cyyynthia
Thank you very much for this effort and the detailed analysis.
The memory increase is expected and easy to explain - see https://github.com/ggerganov/llama.cpp/pull/473#issuecomment-1483399313
The eval time increase is surprising. I am mostly running on Apple Silicon and it looks like I have overlooked this performance degradation. Will look further for ways to resolve it
I figured the memory increase was expected, but I found it quite interesting nonetheless. The load times graph is also quite interesting:
If anyone is interested in running the test on their machine too, here is the script I used (beware, it's written in NodeJS because I'm not very much of a Python dev and it was much faster for me to put together, and it might need slight tweaks depending on your machine ๐ ) Simply put the script in your llama.cpp folder (next to the Makefile) and run it.
https://gist.github.com/cyyynthia/43784451936e2a608566c42b0bacceac
note: the script doesn't work with ggjt format at this time the script now supports all 3 formats. it expects them to be placed in specific folders, check the script for more details.
@cyyynthia fyi I'm also trying to run all the commits, I've converted the code to python and extended it a bit so that it works with windows/cmake/vs. I run each commit 5 times to obtain average timings. I will try to post some results if it finishes without errors.
I still have to figure out something: at which commit do you switch from llama.exe to main.exe?
LLAMA_PATH = 'C:/DATA/TestLLama/CommitTest'
LLAMA_EXE_0 = os.path.join(LLAMA_PATH, 'build/Release/llama.exe')
LLAMA_EXE_1 = os.path.join(LLAMA_PATH, 'build/bin/Release/main.exe')
Or did you use another trick to avoid the folder/name change?
edit: it got stuck after 35 commits at 2456837, i will investigate later what's happening...
The results thus far are gathered in the csv --> result.csv
Or did you use another trick to avoid the folder/name change?
I started collecting data since commit 2d64715ad475f192a4004a52d134c67ccb6f44ad, anything before is skipped and it was already named main at this time. (at least, using make)
it got stuck after 35 commits
24568371ae0d7caf85164abe4753f36a7dba0288 and 5c19c70ba631a8f5d54feb6634e0eea178911a84 are broken commits and do not work (infinite loop, it triggered a timeout I added to my script). You'll also experience compile 3 compile errors, at f5a77a629bd0f37ae1696747633ab42a5530ec15 928480ef5b7b03d7a07e98286aebe3d8b24457d9 and ae44e23ee36c02da0e37ab508a4b473ace724f8e. I updated my script locally before running it so it would be resilient to these issues (didn't reflect that on the gist).
I also adapted my script to skip commits which were not related to ggml/llama/the example, to speed up the process - something I haven't reflected yet on my gist.
Running 5 times and getting an average is a great idea, I haven't done that for my own thing because it would take forever and my poor CPU has been suffering enough ๐ That being said, I ran it twice, and I must say some results seem to indicate it'd have been much better to do 5 runs and peak the average.
Here's my 2 runs, from 2d64715ad475f192a4004a52d134c67ccb6f44ad to ee0c40dd6de8c3c658ae43199939ef40bb1cf408 (excluding commits which didn't introduce any code change):
result_first_run.csv result_second_run.csv
I also updated my gists with the updates I listed above (handle llama run issues, skip irrelevant commits).
2456837 and 5c19c70 are broken commits and do not work (infinite loop, it triggered a timeout I added to my script). You'll also experience compile 3 compile errors, at f5a77a6 928480e and ae44e23. I updated my script locally before running it so it would be resilient to these issues (didn't reflect that on the gist).
Good to know! I just skipped a few commits and let the loop continu at 074bea2 since this was included in you first run. As a result, I seem to have jumped over the 2 faulty commits.
The loop successfully passed the 3 commits where you had the compile errors, this might be an OS/compiler/etc. issue...
So far, the loop is at 119 / 193 commits.
I will restart the loop once it complets: I had to restart a couple of time (continuing at the commit where it failed) to account for special cases (e.g. llama.exe -> main.exe etc.). Even though I'm looping 5 times over each commit I would like to run the loop in a single run while the pc isn't doing anything else so that we have consistent timings.
system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | On Ubuntu 22.04.2 gcc 11 or clang 15: For llama.cpp + llama 30B I see very slow responces: long answer and each word is 1-2 second. (Same time the startup with mmap is extremly fast) But for alpaca.cpp 13B - the answers are much faster
@anzz1 @cyyynthia @ggerganov here are the results of the run over 191 commits, as the title indicates I've evaluated each commit 10 times to obtain the average, max and min token evaluation time. You can find the full results here --> Benckmark_commits_results.csv
I've uploaded the script as a gist -->benchmark_commits_llama_cpp.py
For completeness, I'm using windows 10, cmake 3.24.3, VS22. I'm using an intel xeon w2295, but limited the threads to 6 to be more representative. The benchmark was done using python, each time deleting the build folder, switching commit, rebuilding. 7 commits failed to run which I excluded both from the graph and the results in the csv file.
I've used the following options for the inference:
-s 147852369
-t 6
-p 'Here is a long story about how programming came to be: '
-n 128
-c 1024
--top_k 40
--top_p 0.95
--repeat_last_n 64
--repeat_penalty 1.1
Below you can see the plot for the token timings:
edit @jart below you can see the average loading times on my system:
For my test on ubuntu:
Here you can find script for git bisect: llama.cpp.zip
@KASR Thanks for your effort!
Is the sum of the 4 partial times supposed to be roughly equal to the total time? Because there's some variation there in your data: The three large steps are f5a77a6, 29b7baa, 78ca983.
@KASR Very nice work! Your investigation ends right before the latest AVX optimizations:
I expect after these changes, the timings to be back to normal. Or probably after we merge #654
@sw
After the mmap
changes, the load
time is incorrect:
Currently, the reported load time includes not only the page faults, but also the prompt eval time. So effectively, you get the negative number since the prompt eval time has been accounted 2 times. We have to fix this.
@sw I've uploaded the script that i used as a gist ( benchmark_commits_llama_cpp.py )
The timings were simply extracted from the llama printings. But indeed, when you add the printed timings there is still a bit missing.
e.g. below: 8922.26 - ( 5603.04 + 1169.53 + 21.99 + 1836.83) = 290.87 ms
https://github.com/ggerganov/llama.cpp/commit/f5a77a629bd0f37ae1696747633ab42a5530ec15 and https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.
slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast
I've noticed this as well. Tokens per second definitely drops as the conversation grows.
On Sun, Apr 2, 2023, 09:59 Sylvie @.***> wrote:
f5a77a6 https://github.com/ggerganov/llama.cpp/commit/f5a77a629bd0f37ae1696747633ab42a5530ec15 and 483bab2 https://github.com/ggerganov/llama.cpp/commit/483bab2e3d4a868fe679d8bb32827d2a4df214dc seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.
slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast
โ Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/603#issuecomment-1493343076, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AELMOXOXTFNY6EXWWDW7GA4NANCNFSM6AAAAAAWMIWCKY . You are receiving this because you were mentioned.Message ID: @.***>
f5a77a6 and 483bab2 seem to be major culprits in my case when I locally checked before/after above mentioned commits the miliseconds.
slowdown may seem minimal with small context but as conversation grows longer until hitting max (2048 context?) the slowdown also introduced in above commits also increases in my case atleast
Just to make sure, you're saying that before these commits there was no slowdown for larger contexts? Or just a less significant slowdown?
Tokens per second definitely drops as the conversation grows.
I've been observing this too and I got curious; so I've did some measurements to see how bad the slowdown was. I've plotted all the token times for -n 2000 (2048 ctx). To export the timings, I modified llama.cpp line 1003 and made it print to stderr the token time.
This graph is a single-run on a slightly noisy environment, so take the graph with a grain of salt. I'm definitely intrigued by the sudden bump in token time at the middle, and the acceleration of the performance degradation. I'll try to run more tests about it, and try to extract values from older versions as well.
edit: this test was run on 437e77855a54e69c86fe03bc501f63d9a3fddb0e
I'm definitely intrigued by the sudden bump in token time at the middle
That could be context swapping, see
However, if I understand it correctly, it should only happen once every about n_ctx/2
and not for every token, what the graph implies. Anyways, for your example the location should be around 1024
which fits perfectly.
Edit: nvm, context swapping should happen first at n_ctx
not n_ctx/2
so you wouldn't see it here. My mistake.
I don't think context swap is to blame. It happens once the context is full (if I understand correctly), which in my example should never happen (since I start at 13 (prompt), and generate 2000 tokens). Only once the first swap has been performed, since we "free" half of the context space, your assumption would now hold.
Even if a swap happened, my understanding is that it would generate a freeze in generation, and go back to work as usual with no impact on the token time (or if anything a slight decrease, based on the observations that the more the context fills, the slower it gets).
To prove this, I ran the same test with -n 1024 -c 128. I indeed did observe pauses in the generation (as expected), and here's the graph (red lines mean a context swap):
Swaps occurred after 115, 178, 241, 304, 367, 430, 493, 556, 619, 682, 745, 808, 871, 934 and 997 tokens.
Yes, you're right, I just realised too that my comment makes no sense. But it's interesting that context swap actually resets the increase of time per token.
I've analyzed data for 2000 tokens on 6 different commits and established that 483bab2e3d4a868fe679d8bb32827d2a4df214dc is definitely a HUGE regression for large prompts. The run seg-faults mid-run, but the regression in token times for large inputs is extremely visible already and confirmed by 4870e455b3653f7d7769fa5772b2c90ffad088df, jumping from 415s runtime to 1673s, a 4x increase!
The latest versions do close to nothing to compensate for the massive performance degradation caused:
You can see at the beginning, newer versions are faster (as expected with all the great work that has been going on in optimizing SIMD thingies), but quickly get unbearably slow compared to earlier versions.
Raw CSV: token_times.csv
edit: note: for commits which didn't have --ignore-eos, I've patched the code to make it behave as current versions with --ignore-eos enabled to make sure it generates 2000 tokens.
edit 2: My data also seem to suggest this is the only major regression, and on my machine and on Linux f5a77a629bd0f37ae1696747633ab42a5530ec15 doesn't come with a regression, at least not as bad as the one shown above.
@cyyynthia out of curiosity, if you use the latest commit and remove ggml_cpy from the V_trans (i.e. revert the changes from #439 ) do you then retrieve the original timings?
@cyyynthia out of curiosity, if you use the latest commit and remove ggml_cpy from the V_trans (i.e. revert the changes from #439 ) do you then retrieve the original timings?
Unfortunately, that's not possible (at least not easily) because of ecbe466a364876927994e2f1ec14f4d82301d201. I assembled something similar here, which doesn't even contain most of the recent AVX optimizations and it's still way faster for larger contexts.
@KASR I actually tried, but because ggml.c moved on with the change and removed some code paths that were now unused, I couldn't get it to work. I tried bringing back some code bits in a sort-of naive manner but I got broken output, so I assumed I've been doing something wrong and my results might not be valid.
If someone has a working patch to test, I'd love to give it a shot on my machine and compare the results, but otherwise I'm afraid I can't test that out. That being said, the commit right before the change (404e1da38ec8025707031a8027da14dc1590f952) is tested and on the graph.
Expected Behavior
I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.
Current Behavior
The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.
Environment and Context
I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).
Steps to Reproduce
Here's some context/config when I'm doing the runs:
(left panel is latest llama.cpp, right panel is gpt4all build)
This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp
*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat