Refactor: Formalise Keys.General GGUF KV Store

mofosyne commented 1 month ago

Background Description

During https://github.com/ggerganov/llama.cpp/pull/7499 it turns out that the KV store metadata needs further development.

We need an outline and a consensus on that outline for the KV store in such a way that is not too closely coupled with HuggingFace and is independent enough to service GGUF use cases. Ideally we should be able to remotely fetched all the details as needed or use the model card as fallback.

Below is a stab I had in listing out what Keys i thought about as well as possible hugging face model card key I could use.

GGUF Key (Authorship Metadata Only)	Hugging Face Model Card Key	Example Value	Semantic Description
general.name	model_name	"GPT-3"	Name or title of the model
general.author	model_creator	"TheBloke"	Name(s) of the author(s)
general.version	model_version	"v3.0"	Version number of the model
general.organization	model_organization	"OpenAI"	Organization or institution associated with the model
general.finetune	model_finetune	"Instruct"	finetune portion of model filename
general.basename	model_basename	"gpt3-base"	Basename of the model filename
general.description	model_description	"Large-scale language model."	Brief description of the model and its use cases
general.quantized_by	quantized_by	"OpenAI"	Entity responsible for quantizing the model to gguf
general.parameter_class_attribute	model_parameter_class_attribute	"8x3B"	Parameter Weight Class attribute of model parameters
general.license	license	"MIT"	License type under which the model is released
general.license.name	license_name	"MIT License"	Name of the license
general.license.link	license_link	"https://opensource.org/licenses/MIT"	Link to the full text of the license
general.url	-	"https://openai.com/gpt-3"	URL to the model website or paper
general.doi	-	"10.1234/5678"	Digital Object Identifier (DOI) of the model
general.uuid	-	"123e4567-e89b-12d3-a456-426614174000"	Universally Unique Identifier (UUID) of the model
general.repo_url	-	"https://github.com/openai/gpt-3"	URL to the model source repository
general.source.url	-	"https://arxiv.org/abs/2005.14165"	URL to the source website or paper
general.source.doi	-	"10.1234/5678"	Digital Object Identifier (DOI) of the source
general.source.uuid	-	"123e4567-e89b-12d3-a456-426614174000"	Universally Unique Identifier (UUID) of the source
general.source.repo_url	-	"https://github.com/openai/gpt-3"	URL to the source repository
general.base_model.count	base_model (derived from id)	2	Number of base models used to create the model
general.base_model.{id}.name	base_model (derived from id)	"BERT"	Name or title of the base model
general.base_model.{id}.author	-	"Google"	Name(s) of the author(s) of the base model
general.base_model.{id}.version	base_model (derived from id)	"3.0"	Version number of the base model
general.base_model.{id}.organization	base_model (derived from id)	"Google"	Organization or institution associated with the base model
general.base_model.{id}.url	-	"https://arxiv.org/abs/1810.04805"	URL to the base model website or paper
general.base_model.{id}.doi	-	"10.1234/5678"	Digital Object Identifier (DOI) of the base model
general.base_model.{id}.uuid	-	"123e4567-e89b-12d3-a456-426614174000"	Universally Unique Identifier (UUID) of the base model
general.base_model.{id}.repo_url	base_model (derived from id)	"https://github.com/google/bert"	URL to the base model source repository
general.tags	tags + pipeline_tag	["NLP", "Language Modeling"]	Tags associated with the model (e.g., categories)
general.languages	language	["English", "Spanish"]	Languages supported by the model
general.datasets	datasets	["Wikipedia", "Common Crawl"]	Datasets used to train the model

Possible Refactor Approaches

No response

compilade commented 1 month ago

I think it's not clear enough what exactly general.source.* is, and how different it is from the keys with same names which are directly in general.*.

I think general.doi and general.source.doi are redundant.

And repo_url vs url. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?

Also, general.license is specifically a license identifier from https://spdx.org/licenses/ (so MIT is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name and general.license.link make sense to exist.

mofosyne commented 1 month ago

I think it's not clear enough what exactly general.source. is, and how different it is from the keys with same names which are directly in general..

Hopefully this explains my mental model:

flowchart LR
    B1[Base Model 0] -- merged or finetuned --> source
    B2[Base Model 1] -- merged or finetuned --> source
    source[Source] --converted to gguf--> model[Main Model]

flowchart LR
    B1[general.base_model.0.*] -- merged or finetuned --> source
    B2[general.base_model.1.*] -- merged or finetuned --> source
    source[general.source.*] --converted to gguf--> model[general.*]

So basically general.source.* is if there is no 'training' or 'fine-tuning' but it's simply a format conversion or quantization etc...

And if not converted but directly generated into a gguf file:

flowchart LR
    B1[Base Model 0] -- merged or finetuned --> model
    B2[Base Model 1] -- merged or finetuned --> model
    model[Main Model]

Or if just a straight up new base model safetensor etc... but converted to gguf

flowchart LR
    source[Source] --converted to gguf--> model[Main Model]

I think general.doi and general.source.doi are redundant.

You think so? I think while they share the same weights, it a different digital object (due to possible loss of accuracy during the qant process).

And repo_url vs url. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?

Yeah my intent is that url refers to "homepages, paper" while the "repo_url" is primarily for the "code, weights"

Also, general.license is specifically a license identifier from https://spdx.org/licenses/ (so MIT is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name and general.license.link make sense to exist.

Yeah you got the idea why I added the extra two fields, multiple people were using 'other' in place of license and putting the actual license in the two extra separate field.

Add support for integrated model card?

Should we bake support for model cards to be copied verbatim into the model? Issue would be the tendency for model card writers to put external image links etc... so copying the model card markdown content might not be a good idea. But if we do then these are the proposed additions.

The difference between our KV GGUF storage and the model card format is that while the KV keyvalue store is quite ridge, the model card metadata is quite freeform which has it's own advantage (but clashes with the need of the KV store). So it might be worth copying the model card content in but keeping it somewhat separate.

Below is some of the possible fields we may want to include under general.model_card.*, we might not want to use all these ideas below. Happy to hear your thought at least which one is a good idea to use

general.model_card.format.uuid : UUID representing the 'model card' hosting context, recorded in gguf if common. This is important in my opinion to decoupling the model_card format from hugging face in case future providers come into the fray e.g. maybe github wants to get in on the fun?
general.model_card.format.name : Human readable name of the model card hosting context (.e.g huggingface)
general.model_card.front_matter : Json encoded representation of the yaml front matter of the model card
general.model_card.content : Markdown encoded model card content
general.model_card.thumbnail : Binary encoded image banner? I see some model card try to have a fancy image banner to act as a visual representation of the model brand.
general.model_card.url : Maybe we want an external link to the repo model card in case the author changes the model card content?

How this would work with say a 'model browser' is that it would allow the user to read the model card in an offline manner e.g. a file browser would show the model card to the side when the model file is selected.

Galunid commented 1 month ago

I'd like to propose inclusion of git commit that was used to convert the model, as well as the conversion script used. I think it'd be extremely useful for debugging issues with converted models that are shared via gguf on the hub. Additionally I think including some sort of versioning could be useful in case we need to break backwards compatibility. We could have a place in the code where we check if model version > n and if not we display suitable warning in case there's something like the tokenizer change. This could also be used by huggingface to display some sort of badge for deprecated models/deleting them.

TLDR:

include git commit
include convert script used to create model
include some sort of gguf versioning for when we break backwards compatibility and model become deprecated/broken

mofosyne commented 1 month ago

@Galunid interesting. Could you also suggest a general.X key name similar to my table above so I can at least know how to name it if it makes sense to include? This is my guess:

general.source.git.hash
general.source.conversion_script - Not sure how this would look... in terms of content. Got a conversion script?

Also I thought gguf file format structure already have versioning? gguf file structure image

Galunid commented 1 month ago

I'd propose

general.llama_cpp.git_hash
genera.llama_cpp.conversion_script

or something like that. I'd put them under llama_cpp key (?), rather than source, since we are setting properties regarding llama.cpp, rather than source model's. To be clear, by git commit I meant llama.cpp's HEAD pointer in the llama.cpp directory. One problem is that some users may not use git version, but downloaded one or installed using brew, so I think it'd be good to have fallback to something else. As for conversion script I meant something like convert-hf-to-gguf.py, or convert-legacy-llama.py with possibility of different values should other scripts appear. This could be hardcoded directly in scripts, so it should be simple to implement.

To be clear these are just some ideas that I feel make sense, but I'm very much open to critique.

Galunid commented 1 month ago

As for version, it's true that gguf has it, but I considered it more of a GGUF Standard version (major). The one I propose is different in a sense that it's more of a minor revision used mainly for llama.cpp. For example with tokenizer changes where we add tokenizer.pre meta-field we don't break the GGUF Standard (major) version, so we shouldn't change it. I propose minor version we could update for when we break backwards compatibility without changing the standard.

compilade commented 1 month ago

I'd propose

general.llama_cpp.git_hash

An unfortunate effect of including the git commit hash of llama.cpp would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.

Galunid commented 1 month ago

An unfortunate effect of including the git commit hash of llama.cpp would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.

I was thinking of adding a flag that could prevent that, or writing script like gguf-diff that could compare models in a way similar to other diff tools.

compilade commented 1 month ago

or writing script like gguf-diff that could compare models in a way similar to other diff tools.

I'd be very interested in this. Both non-interactively and interactively, to investigate differences. At first for metadata and yes/no for tensor data. But then comparisons would be useful for tensor data, like when comparing quantization (re)implementations. It would be nice to output hex samples of differring data granulated on the block size of the quant formats. I'm not sure about cross-quantization diffing, though.

But no need for all of that at first (since it would add quite a bit of complexity).

I was thinking of adding a flag that could prevent that

Both being able to set it to a known commit hash and, at least at first, being able to prevent adding the commit hash would be necessary.

I've been thinking... If metadata is something that should be easily changeable in GGUF models without rewriting all the tensor data, would it be more convenient if the GGUF metadata KV were at the end of the files instead?

For backward compatibility, instead of changing too deeply the GGUF format, there could even be a normal GGUF metadata key at the beginning (along with some others) which would store an offset for the extended metadata key-value pairs.

Or that offset could be calculated from the tensor info section.

Might not be worth it, but this would allow easier metadata additions even to existing GGUF models.

mofosyne commented 1 month ago

@compilade you might be interested that https://github.com/ggerganov/llama.cpp/discussions/7839 I was trying for a stab at Hashing the weights of the models to generate a uuid. Obviously my approach was an attempt at having a hash that would ignore quantisation, but anyway I hope that sparks some idea.

What are you trying to compare when diffing models? Difference in metadata? Difference in layers? (E.g. mismatch in layer X)?

ggerganov / llama.cpp