ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.06k stars 9.33k forks source link

Bug: Speed regression from early this year #8945

Open IndustrialOne opened 1 month ago

IndustrialOne commented 1 month ago

What happened?

https://github.com/nomic-ai/gpt4all/issues/2204 Since I upgraded to gpt4all 2.6.2 (which updated llama.cpp) my speed dropped from 3-4 t/s to 1 t/s. I am getting 1/3 the speed across all models, why? https://github.com/ggerganov/llama.cpp/compare/6b0a7420d...fbf1ddec6

This is a big problem that makes LLMs unusable now. I just got a new computer and 3-4 t/s was already slow but manageable. 1 t/s is just unusable. Is there a good reason why the speed regressed so much?

Name and Version

https://github.com/ggerganov/llama.cpp/compare/6b0a7420d...fbf1ddec6

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

JohannesGaessler commented 1 month ago

If you want to have any chance at all of getting this addressed you will need to post which hardware you are using and do a git bisect to identify the exact commit that caused the performance regression. If you do not do both of these things this issue will go nowhere because it is otherwise not feasible for developers to work on this.

IndustrialOne commented 1 month ago

My hardware is in the link. Using an i5 10400, Windows 10 in a VM. I don't know how to do a git bisect. If it's to do with comparing code, I wouldn't understand diddly shit anyway.

Why isn't it feasible for the developers to figure out why their product is suddenly 3x slower? It's a pretty serious thing to ignore.

Are you telling me I'm the only one who experienced this speed regression?

JohannesGaessler commented 1 month ago

I don't know how to do a git bisect. If it's to do with comparing code, I wouldn't understand diddly shit anyway.

You don't have to understand any code, all you have to do is do a binary search of the llama.cpp commit history to nail down the exact commit that is causing issues for you. The basic workflow is to compile llama.cpp for a commit, tell git bisect whether the result is good or bad, and repeat for the next commit.

Why isn't it feasible for the developers to figure out why their product is suddenly 3x slower? It's a pretty serious thing to ignore.

Are you telling me I'm the only one who experienced this speed regression?

I'm telling you that you that among the llama.cpp developers no one has experienced this same performance regression or it would have already been fixed. If the problem is indeed caused by llama.cpp we therefore need someone that does experience this performance regression when it started happening.

IndustrialOne commented 1 month ago

I'm telling you that you that among the llama.cpp developers no one has experienced this same performance regression or it would have already been fixed.

Interesting.

I am using GPT4ALL for interacting with my LLMs which uses llama.cpp and all I know is that when upgrading from GPT4ALL 2.6.1 > 2.6.2 is when the dramatic slowdown happened. The author blamed llama.cpp which is why I came here to report it. I have no clue how to compile this into the app to test it.

You say it would've been fixed immediately, but that's not obvious to me. Maybe the devs have a justifiable reason for the slowdown? 3x slower because it's 3x improved? Is it because they optimized the code for GPU at the expense of CPU (I'm only using CPU)? I have no clue.

JohannesGaessler commented 1 month ago

I have no clue how to compile this into the app to test it.

Then I guess you'll have to wait for someone else to provide the necessary debugging information.

IndustrialOne commented 1 month ago

Hold up, I tested LM studio which I assume uses the latest llama.cpp and it's a lot faster than any GPT4ALL version. I guess the problem wasn't with llama.cpp after all.

jeroen-mostert commented 1 month ago

Or the problem was with llama.cpp, but has long since been fixed. Many third-party projects use outdated versions of llama.cpp (or "curated" ones, to be more charitable) with their own selection of build options, not to mention the way they invoke it at runtime. As llama.cpp also builds its own binaries, it's always a good idea to try and test those directly, even if you don't get the nicest experience that way. A problem that can be demonstrated with the latest version also has a much greater chance of getting reproduced and fixed.

IndustrialOne commented 1 month ago

GPT4ALL's token speed regressed permanently and they did not hold onto the same llama.cpp, they've continually updated it since while the regression remained. The author blamed llama.cpp so I of course assumed llama.cpp kept the code responsible for the slowdown, but GPT4ALL's staff aren't responding to my request for clarification, so to hell with it, I'm switching to LM Studio. Apologies for the bother.