Open mofosyne opened 1 month ago
I think it's not clear enough what exactly general.source.*
is, and how different it is from the keys with same names which are directly in general.*
.
I think general.doi
and general.source.doi
are redundant.
And repo_url
vs url
. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?
Also, general.license
is specifically a license identifier from https://spdx.org/licenses/ (so MIT
is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name
and general.license.link
make sense to exist.
I think it's not clear enough what exactly general.source. is, and how different it is from the keys with same names which are directly in general..
Hopefully this explains my mental model:
flowchart LR
B1[Base Model 0] -- merged or finetuned --> source
B2[Base Model 1] -- merged or finetuned --> source
source[Source] --converted to gguf--> model[Main Model]
flowchart LR
B1[general.base_model.0.*] -- merged or finetuned --> source
B2[general.base_model.1.*] -- merged or finetuned --> source
source[general.source.*] --converted to gguf--> model[general.*]
So basically general.source.*
is if there is no 'training' or 'fine-tuning' but it's simply a format conversion or quantization etc...
And if not converted but directly generated into a gguf file:
flowchart LR
B1[Base Model 0] -- merged or finetuned --> model
B2[Base Model 1] -- merged or finetuned --> model
model[Main Model]
Or if just a straight up new base model safetensor etc... but converted to gguf
flowchart LR
source[Source] --converted to gguf--> model[Main Model]
I think general.doi and general.source.doi are redundant.
You think so? I think while they share the same weights, it a different digital object (due to possible loss of accuracy during the qant process).
And repo_url vs url. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?
Yeah my intent is that url refers to "homepages, paper" while the "repo_url" is primarily for the "code, weights"
Also, general.license is specifically a license identifier from https://spdx.org/licenses/ (so MIT is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name and general.license.link make sense to exist.
Yeah you got the idea why I added the extra two fields, multiple people were using 'other' in place of license and putting the actual license in the two extra separate field.
Should we bake support for model cards to be copied verbatim into the model? Issue would be the tendency for model card writers to put external image links etc... so copying the model card markdown content might not be a good idea. But if we do then these are the proposed additions.
The difference between our KV GGUF storage and the model card format is that while the KV keyvalue store is quite ridge, the model card metadata is quite freeform which has it's own advantage (but clashes with the need of the KV store). So it might be worth copying the model card content in but keeping it somewhat separate.
Below is some of the possible fields we may want to include under general.model_card.*
, we might not want to use all these ideas below. Happy to hear your thought at least which one is a good idea to use
general.model_card.format.uuid
: UUID representing the 'model card' hosting context, recorded in gguf if common. This is important in my opinion to decoupling the model_card format from hugging face in case future providers come into the fray e.g. maybe github wants to get in on the fun?general.model_card.format.name
: Human readable name of the model card hosting context (.e.g huggingface) general.model_card.front_matter
: Json encoded representation of the yaml front matter of the model cardgeneral.model_card.content
: Markdown encoded model card contentgeneral.model_card.thumbnail
: Binary encoded image banner? I see some model card try to have a fancy image banner to act as a visual representation of the model brand.general.model_card.url
: Maybe we want an external link to the repo model card in case the author changes the model card content?How this would work with say a 'model browser' is that it would allow the user to read the model card in an offline manner e.g. a file browser would show the model card to the side when the model file is selected.
I'd like to propose inclusion of git commit
that was used to convert the model, as well as the conversion script used. I think it'd be extremely useful for debugging issues with converted models that are shared via gguf on the hub. Additionally I think including some sort of versioning could be useful in case we need to break backwards compatibility. We could have a place in the code where we check if model version > n
and if not we display suitable warning in case there's something like the tokenizer change. This could also be used by huggingface to display some sort of badge for deprecated models/deleting them.
git commit
@Galunid interesting. Could you also suggest a general.X
key name similar to my table above so I can at least know how to name it if it makes sense to include? This is my guess:
general.source.git.hash
general.source.conversion_script
- Not sure how this would look... in terms of content. Got a conversion script?Also I thought gguf file format structure already have versioning? gguf file structure image
I'd propose
general.llama_cpp.git_hash
genera.llama_cpp.conversion_script
or something like that. I'd put them under llama_cpp
key (?), rather than source
, since we are setting properties regarding llama.cpp
, rather than source model's. To be clear, by git commit I meant llama.cpp
's HEAD
pointer in the llama.cpp
directory. One problem is that some users may not use git version, but downloaded one or installed using brew
, so I think it'd be good to have fallback to something else. As for conversion script I meant something like convert-hf-to-gguf.py
, or convert-legacy-llama.py
with possibility of different values should other scripts appear. This could be hardcoded directly in scripts, so it should be simple to implement.
To be clear these are just some ideas that I feel make sense, but I'm very much open to critique.
As for version, it's true that gguf
has it, but I considered it more of a GGUF Standard
version (major). The one I propose is different in a sense that it's more of a minor revision used mainly for llama.cpp
. For example with tokenizer changes where we add tokenizer.pre
meta-field we don't break the GGUF Standard
(major) version, so we shouldn't change it. I propose minor version we could update for when we break backwards compatibility without changing the standard.
I'd propose
general.llama_cpp.git_hash
An unfortunate effect of including the git commit hash of llama.cpp
would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.
An unfortunate effect of including the git commit hash of llama.cpp would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.
I was thinking of adding a flag that could prevent that, or writing script like gguf-diff
that could compare models in a way similar to other diff
tools.
or writing script like
gguf-diff
that could compare models in a way similar to other diff tools.
I'd be very interested in this. Both non-interactively and interactively, to investigate differences. At first for metadata and yes/no for tensor data. But then comparisons would be useful for tensor data, like when comparing quantization (re)implementations. It would be nice to output hex samples of differring data granulated on the block size of the quant formats. I'm not sure about cross-quantization diffing, though.
But no need for all of that at first (since it would add quite a bit of complexity).
I was thinking of adding a flag that could prevent that
Both being able to set it to a known commit hash and, at least at first, being able to prevent adding the commit hash would be necessary.
I've been thinking... If metadata is something that should be easily changeable in GGUF models without rewriting all the tensor data, would it be more convenient if the GGUF metadata KV were at the end of the files instead?
For backward compatibility, instead of changing too deeply the GGUF format, there could even be a normal GGUF metadata key at the beginning (along with some others) which would store an offset for the extended metadata key-value pairs.
Or that offset could be calculated from the tensor info section.
Might not be worth it, but this would allow easier metadata additions even to existing GGUF models.
@compilade you might be interested that https://github.com/ggerganov/llama.cpp/discussions/7839 I was trying for a stab at Hashing the weights of the models to generate a uuid. Obviously my approach was an attempt at having a hash that would ignore quantisation, but anyway I hope that sparks some idea.
What are you trying to compare when diffing models? Difference in metadata? Difference in layers? (E.g. mismatch in layer X)?
Background Description
During https://github.com/ggerganov/llama.cpp/pull/7499 it turns out that the KV store metadata needs further development.
We need an outline and a consensus on that outline for the KV store in such a way that is not too closely coupled with HuggingFace and is independent enough to service GGUF use cases. Ideally we should be able to remotely fetched all the details as needed or use the model card as fallback.
Below is a stab I had in listing out what Keys i thought about as well as possible hugging face model card key I could use.
Possible Refactor Approaches
No response