ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.86k stars 9.46k forks source link

Confusion about the model versioning #647

Closed anzz1 closed 1 year ago

anzz1 commented 1 year ago

So back when project started, we had the first "unversioned" model format without the embedded tokens, with the magic 0x67676d6c (ggml).

Problem with that was that it didn't have any versioning support, so newer/older versions would just think "I don't know what this is, this is not a model file".

Then on this commit https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4, adding the embedded the tokens we got a new versioned model format, with magic 0x67676d66 (ggmf), along with versioning, so it could now say "this is definitely a model file, but a wrong version" as shown here: https://github.com/ggerganov/llama.cpp/blob/3bcc129ba881c99795e850b0a23707a4dfdabe9d/llama.h#L22

That was definitely a good move towards future proofing. Any breaking changes could just add +1 to that version and all would be fine and dandy for the next 4294967295 versions of the model format.

But then came this commit: https://github.com/ggerganov/llama.cpp/commit/78ca9838ee36660a776e97e3391b6fb5dcaacf7f Which for absolutely no good reason changed the magic to 0x67676a74 (ggjt), kept the version at 1, completely breaking the whole versioning system and made it worthless.

Now we're back to the system where the different versions of llama.cpp don't understand that "yes , these are indeed models but older/newer versions". We already fixed this problem, why the absolute f the need to break something that is already perfectly fine?

I just cannot understand the reasoning behind this except maybe vanity, I guess (as the new magic uses the initials of the one who did the commit as the magic) ? Absolutely ridiculous to break a perfectly functional system. Or is there actually some proper reason for this that I'm completely missing?

It is already a struggle since various older forks like alpaca.cpp / gpt4all uses the unversioned format, then the move to the versioned format already fractured the community a bit, but was a good and necessary change overall and fixed the version confusion problem for the future versions. But now the third format change, which is made intentionally worse by changing the magic instead of doing it properly and using the versioning system put in place back then and causing even more confusion as now all the commits since https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4 , where this whole problem was already fixed, is now broken again and those versions would say "I do now know what this is, it is not a model file" of the new format. WHY?

Again, the proper way of updating the model as envisioned by the versioning system is to:

-#define LLAMA_FILE_VERSION 1
+#define LLAMA_FILE_VERSION 2
#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex

and not

#define LLAMA_FILE_VERSION 1
-#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
+#define LLAMA_FILE_MAGIC 0x67676a74 // 'ggjt' in hex

like it was committed https://github.com/ggerganov/llama.cpp/commit/78ca9838ee36660a776e97e3391b6fb5dcaacf7f.

What is actually the line of thinking here, we just going to keep the version at 1, completely disuse the versioning system and keep changing the magic to whoever's initials who is doing that change? How the everliving F does that make any sense?!

If this actually was done by accident, not understanding the versioning system and not by intention, sorry for my scathing remarks. If it's intentional and breaking a perfectly functional system for vanity's sake, all the scathe is well deserved.

Pulling dumb shit like this is a good way to make a fantastic open-source project fall apart quickly.

PriNova commented 1 year ago

I vote for the versioning system.

Another way to use magics (if it need to be used) along with versioning is the use of file extensions. .pth (pytorch) .llm (old llama) .alp (alpacas) .gpt4 (gpt4all) etc.

For the user it is more explicit visible what model it is and it is easy in the code to check for file-extensions.

LostRuins commented 1 year ago

Problem is that the commit with the changed magic already went live, so now if you revert to a version numbering system you actually end up with 4 versions.

P.S. I plan to forever maintain backwards compatibility with every single model format version in my own fork. I have kept the legacy tokenizer alpaca.cpp uses from the very first version. I don't know why that is not a consideration.

Edit: And it's done. It transparently supports ggml (v1, alpaca), gghf (v2) and ggjt (v3) magics. Lets see how many days we can last until we get more new magic again! /s

anzz1 commented 1 year ago

Yeah I didn't really see a vote anywhere about changing the model magic to something else and destroying the versioning, it was kinda just sneakily pushed in without inquiring from anyone that is the "engrave your initials, destroy versioning" the style that the community wants. If it is, then all fine, the project can be forked anyway but the problem I'm having here is the way how it was done in in a sneaky way without having a poll or any discussion about it.

There wasn't even a poll about whether a new format version should be introduced at all, let alone destroying the versioning part. The arrogance of forcing "your way" / "your opinion" into others without asking the community first rubs me in a very, very wrong way. And the vanity of carving your initials into something which isn't yours but rather a collaborative effort is so pompous that in combination of those the Finnish in me just can't stand that level of pompousness and arrogance.

Problem is that the commit with the changed magic already went live, so now if you revert to a version numbering system you actually end up with 4 versions.

You are correct about that and that's why reverting it quickly is such a priority. I'm sure I'm not the only one disagreeing with this so there'll be even more formats when forks happen disagreeing with this latest development.

P.S. I plan to forever maintain backwards compatibility with every single model format version in my own fork. I have kept the legacy tokenizer alpaca.cpp uses from the very first version. I don't know why that is not a consideration.

Tbh at this point I'm considering working on a fork instead too.

Note that I don't have skin in the game, the new versioning format wasn't implemented by me. So I'm defending against destroying other people's work for vanity's sake and forcing your opinion without a poll, before anyone accuses that I'm just defending my own work from being destroyed.

slaren commented 1 year ago

I am very confused about this and a few other things related to that PR as well, but ultimately I am not sure that it is worth making a big deal out of it.

linouxis9 commented 1 year ago

I completely agree.. At least I guess we can add a check for the old magic number to at least tell the user that we recognize the models and that they need to be converted. Tbh, I would have been fine with the change if this had been done, as a new LLAMA_FILE_VERSION would have required such message and conversion anyway. Though I guess we could have kept backward compatibility (even with a new magic number, and without the ability to mmap old models), and I'm not sure we always want mmap-ing the file anyway...

InconsolableCellist commented 1 year ago

I'm a bit confused, it seems that both text-generation-webui and llama.cpp used to be able to use the same model files. But now I think text-generation recommends using .safetensors and llama.cpp doesn't understand what that format is. How does one convert from safetensors to the latest ggml with the apparently blessed magic number?

Regarding versioning, using version numbers is a much better idea than magic numbers.

Ronsor commented 1 year ago

Problem is that the commit with the changed magic already went live, so now if you revert to a version numbering system you actually end up with 4 versions.

Support both

sw commented 1 year ago

Yes, this was a bit unfortunate. @ggerganov supported the change by @jart in order to honor her work on mmap. That certainly deserves praise, but breaking the promise that the existing versioning scheme made is problematic. (the promise being that a change in format would keep the magic number but increment the version)

InconsolableCellist commented 1 year ago

Well, a bit more than that:

Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? smile

Versioning was removed for posterity, it would seem. Making this a habit seems like a remarkably bad idea to me. That's what the commit history and credits are for.

prusnak commented 1 year ago

Although I wasn't thrilled to see the change either, we must respect Georgi's wishes, as he is still the main maintainer of the project. You have every right to respond by forking the project, which is the beauty of open source; however, I am fairly certain that this effort will be futile, because maintaining a project is a very time-consuming task. If this incident is the sole reason for the fork, it will likely end in vain, and it would be better to just simply swallow the anger and continue contributing to the main project.

Everything above said, I do believe it would be highly beneficial for the project if this were the last "magic" change, and any further versioning of the format took place through the version field.

ggerganov commented 1 year ago

Sorry about this. I like that you care about the project so much and I agree that probably this wasn't the brightest idea. But, I think the changes from the PR handle pretty good the model migration, so we should be good and there is no reason for big drama.

The most important goal for me for this project is to have fun and experiment. I am not doing it to create a stable, production-ready software and I will not really care about people who don't know how to migrate their models. I care about the people that are interested in helping to improve the software and unlock new, interesting and useful applications. And I'm pretty sure all of you fall in the second category.

There will probably be other stuff that we'll mess up in the future and that is fine. No need to sweat about it. Moreover, with how fast things are going lately, chances are in a couple of months llama.cpp will be far forgotten into the past of exponential AI development, so .. 😄

prusnak commented 1 year ago

Thanks, @ggerganov for your comment. I feel it resolves "the confusion" and we can close this issue.

blackhole89 commented 1 year ago

Per the conclusion of the unfortunate discussion in #711, where Georgi also suggested that he is on board with a review of what is going on with the format, it seems that this issue might be relevant again. I am reopening it for now with a particular eye to the comment that @LostRuins made above (since before I noticed it I was about to try and reinvent that wheel myself):

And it's done. It transparently supports ggml (v1, alpaca), gghf (v2) and ggjt (v3) magics.

@LostRuins - Would you mind making a PR with that patch? I think that for now introducing something like it would be the best way to go, since it would minimize disruption to both end users and other existing development efforts which are after all not all revolving around the file format. We then may be able to compare the situation before/after 78ca9838ee36660a776e97e3391b6fb5dcaacf7f/before any mmap-related changes and thereby more effectively determine whether any performance issues are intrinsic to mmap or the change to memory layout.

comex commented 1 year ago

You may also want to support the slightly different ggml format used by GPT4All, although doing so requires a hack since there is no difference in magic number. This is how I did it in my currently-PRed conversion script:

https://github.com/ggerganov/llama.cpp/blob/b9c372dc7b571120e50cb1085fa05e0b209eeef5/convert.py#L703

LostRuins commented 1 year ago

@blackhole89 hmm a direct PR might be difficult for a few reasons:

  1. My fork (https://github.com/LostRuins/koboldcpp) is under AGPL license due to inclusion of embedded Kobold Lite, whereas the parent repo is under MIT license. If code is taken from it, then the parent repo would have to become AGPL licensed too, something I'm not sure everyone would want.
  2. There are a lot of functions required to support older formats, as it's not just the ggml magic that changes but the tokenizer and how vocab scores are stored. I actually duplicate a separate load function which I use for legacy formats, so I can continue pulling updates upstream. It's a bit hacky and i'm not sure it would be a good idea to merge it all in.

but my changes are open sourced, so feel free to reference them if anyone decides to officially port in backwards compatibility.

KASR commented 1 year ago

And it's done. It transparently supports ggml (v1, alpaca), gghf (v2) and ggjt (v3) magics.

@LostRuins - Would you mind making a PR with that patch? I think that for now introducing something like it would be the best way to go, since it would minimize disruption to both end users and other existing development efforts which are after all not all revolving around the file format. We then may be able to compare the situation before/after 78ca983/before any mmap-related changes and thereby more effectively determine whether any performance issues are intrinsic to mmap or the change to memory layout.

@blackhole89 I've tried to perform such a benchmark for #603

You can find the full results here --> Benckmark_commits_results.csv The python code used is here -->benchmark_commits_llama_cpp.py

It's important to note that the timings were obtained via the text output from llama after the evaluation (as can be seen in the gist). The results are averaged over 10 evaluations since I noticed variations in the timings. I've use n=128, however, the latest comments in #603 indicate that performance drops if this in increased. And off course the results will be depended on the system, so these could not be representative at all...

Average_total_times

Average_token_times

LostRuins commented 1 year ago

Actually why are people so concerned about load times? Inference times are way more important - after all you should only load the model once per session.

danielzgtg commented 1 year ago

Did the load time PR impair inference times at all? There's no way it could have. If my mouse and keyboard are both broken, it's nonsense to say "why do people care about the mouse? the keyboard is way more important". That's especially when someone knows how to fix the mouse (load times) but nobody knows how to improve the keyboard (inference times)

KASR commented 1 year ago

Actually why are people so concerned about load times? Inference times are way more important - after all you should only load the model once per session.

Indeed, that's why I posted the total and load time on a single chart. And this is 'only' for n=128, as the number of tokens grows or during chat mode, the load time probably becomes a negligible part of the total time.

Did the load time PR impair inference times at all? There's no way it could have. If my mouse and keyboard are both broken, it's nonsense to say "why do people care about the mouse? the keyboard is way more important". That's especially when someone knows how to fix the mouse (load times) but nobody knows how to improve the keyboard (inference times)

You are correct: the impact on the token eval timings is not measurable in my test. You can see that over the last commits the eval time for the tokens remains approximately the same. (For completeness, after reading the other issues/commits I have to say that I'm using a ssd, so this doesn't tell anything for the spinning hhd's.)

blackhole89 commented 1 year ago

@KASR Thanks for collecting the data! This is very helpful. It does indeed look like whatever memory layout changes happened with the ggjt change had barely any, if any, impact on inference times (at least on your machine). There are however some pretty dramatic leaps in older history, which we ought to review.

@danielzgtg There could theoretically have been something as a consequence of changed memory layout (caching is subtle and very relevant here), but it doesn't seem like this happened, at least on the machine the test was conducted on.

@LostRuins I see. Are some of the relevant patches not authored by you/having a nontrivial dependency on code that is not authored by you? Because otherwise, it's not like you wouldn't have the right to offer them to this repository under a different license. It's not like AGPLing your own code precludes you from releasing it under any other license.

LostRuins commented 1 year ago

Did the load time PR impair inference times at all? There's no way it could have. If my mouse and keyboard are both broken, it's nonsense to say "why do people care about the mouse? the keyboard is way more important". That's especially when someone knows how to fix the mouse (load times) but nobody knows how to improve the keyboard (inference times)

More like, your mouse is slightly wonky and your keyboard is missing three keys. If you only use the mouse once a week and the keyboard the rest of the time, it would be a waste of time to focus on trying to fix that slightly wonky mouse rather than the keyboard. Not saying it's pointless, having both fixed would be great but time and resources are better focused where most useful.

@LostRuins I see. Are some of the relevant patches not authored by you/having a nontrivial dependency on code that is not authored by you? Because otherwise, it's not like you wouldn't have the right to offer them to this repository under a different license. It's not like AGPLing your own code precludes you from releasing it under any other license.

@blackhole89 tbh I'm not too proud of my fix, the relevant patches are basically me duct taping different versions of the model loader function together and switching between them based on certain values read into the file. Regarding the AGPL license, yes I could technically separate out some code and relicense that part separately, however if others see PRs from my repo getting merged in they might assume that other code from that repo is okay to use, and that's specifically not true for anything related to the Kobold API and Kobold Lite stuff which is included in my repo, of which I am the author and do not want inside future closed source projects.