ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.74k stars 8.84k forks source link

Refactor: Formalise Keys.General GGUF KV Store #7836

Open mofosyne opened 1 month ago

mofosyne commented 1 month ago

Background Description

During https://github.com/ggerganov/llama.cpp/pull/7499 it turns out that the KV store metadata needs further development.

We need an outline and a consensus on that outline for the KV store in such a way that is not too closely coupled with HuggingFace and is independent enough to service GGUF use cases. Ideally we should be able to remotely fetched all the details as needed or use the model card as fallback.

Below is a stab I had in listing out what Keys i thought about as well as possible hugging face model card key I could use.

GGUF Key (Authorship Metadata Only) Hugging Face Model Card Key Example Value Semantic Description
general.name model_name "GPT-3" Name or title of the model
general.author model_creator "TheBloke" Name(s) of the author(s)
general.version model_version "v3.0" Version number of the model
general.organization model_organization "OpenAI" Organization or institution associated with the model
general.finetune model_finetune "Instruct" finetune portion of model filename
general.basename model_basename "gpt3-base" Basename of the model filename
general.description model_description "Large-scale language model." Brief description of the model and its use cases
general.quantized_by quantized_by "OpenAI" Entity responsible for quantizing the model to gguf
general.parameter_class_attribute model_parameter_class_attribute "8x3B" Parameter Weight Class attribute of model parameters
general.license license "MIT" License type under which the model is released
general.license.name license_name "MIT License" Name of the license
general.license.link license_link "https://opensource.org/licenses/MIT" Link to the full text of the license
general.url - "https://openai.com/gpt-3" URL to the model website or paper
general.doi - "10.1234/5678" Digital Object Identifier (DOI) of the model
general.uuid - "123e4567-e89b-12d3-a456-426614174000" Universally Unique Identifier (UUID) of the model
general.repo_url - "https://github.com/openai/gpt-3" URL to the model source repository
general.source.url - "https://arxiv.org/abs/2005.14165" URL to the source website or paper
general.source.doi - "10.1234/5678" Digital Object Identifier (DOI) of the source
general.source.uuid - "123e4567-e89b-12d3-a456-426614174000" Universally Unique Identifier (UUID) of the source
general.source.repo_url - "https://github.com/openai/gpt-3" URL to the source repository
general.base_model.count base_model (derived from id) 2 Number of base models used to create the model
general.base_model.{id}.name base_model (derived from id) "BERT" Name or title of the base model
general.base_model.{id}.author - "Google" Name(s) of the author(s) of the base model
general.base_model.{id}.version base_model (derived from id) "3.0" Version number of the base model
general.base_model.{id}.organization base_model (derived from id) "Google" Organization or institution associated with the base model
general.base_model.{id}.url - "https://arxiv.org/abs/1810.04805" URL to the base model website or paper
general.base_model.{id}.doi - "10.1234/5678" Digital Object Identifier (DOI) of the base model
general.base_model.{id}.uuid - "123e4567-e89b-12d3-a456-426614174000" Universally Unique Identifier (UUID) of the base model
general.base_model.{id}.repo_url base_model (derived from id) "https://github.com/google/bert" URL to the base model source repository
general.tags tags + pipeline_tag ["NLP", "Language Modeling"] Tags associated with the model (e.g., categories)
general.languages language ["English", "Spanish"] Languages supported by the model
general.datasets datasets ["Wikipedia", "Common Crawl"] Datasets used to train the model

Possible Refactor Approaches

No response

compilade commented 1 month ago

I think it's not clear enough what exactly general.source.* is, and how different it is from the keys with same names which are directly in general.*.

I think general.doi and general.source.doi are redundant.

And repo_url vs url. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?

Also, general.license is specifically a license identifier from https://spdx.org/licenses/ (so MIT is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name and general.license.link make sense to exist.

mofosyne commented 1 month ago

I think it's not clear enough what exactly general.source. is, and how different it is from the keys with same names which are directly in general..

Hopefully this explains my mental model:

flowchart LR
    B1[Base Model 0] -- merged or finetuned --> source
    B2[Base Model 1] -- merged or finetuned --> source
    source[Source] --converted to gguf--> model[Main Model]
flowchart LR
    B1[general.base_model.0.*] -- merged or finetuned --> source
    B2[general.base_model.1.*] -- merged or finetuned --> source
    source[general.source.*] --converted to gguf--> model[general.*]

So basically general.source.* is if there is no 'training' or 'fine-tuning' but it's simply a format conversion or quantization etc...

And if not converted but directly generated into a gguf file:

flowchart LR
    B1[Base Model 0] -- merged or finetuned --> model
    B2[Base Model 1] -- merged or finetuned --> model
    model[Main Model]

Or if just a straight up new base model safetensor etc... but converted to gguf

flowchart LR
    source[Source] --converted to gguf--> model[Main Model]

I think general.doi and general.source.doi are redundant.

You think so? I think while they share the same weights, it a different digital object (due to possible loss of accuracy during the qant process).

And repo_url vs url. There can be homepages, paper, code, and weights. Which goes where? What if they are in different places?

Yeah my intent is that url refers to "homepages, paper" while the "repo_url" is primarily for the "code, weights"

Also, general.license is specifically a license identifier from https://spdx.org/licenses/ (so MIT is correct), but not all licenses are there (e.g. the custom Llama license), so general.license.name and general.license.link make sense to exist.

Yeah you got the idea why I added the extra two fields, multiple people were using 'other' in place of license and putting the actual license in the two extra separate field.


Add support for integrated model card?

Should we bake support for model cards to be copied verbatim into the model? Issue would be the tendency for model card writers to put external image links etc... so copying the model card markdown content might not be a good idea. But if we do then these are the proposed additions.

The difference between our KV GGUF storage and the model card format is that while the KV keyvalue store is quite ridge, the model card metadata is quite freeform which has it's own advantage (but clashes with the need of the KV store). So it might be worth copying the model card content in but keeping it somewhat separate.

Below is some of the possible fields we may want to include under general.model_card.*, we might not want to use all these ideas below. Happy to hear your thought at least which one is a good idea to use

How this would work with say a 'model browser' is that it would allow the user to read the model card in an offline manner e.g. a file browser would show the model card to the side when the model file is selected.

Galunid commented 1 month ago

I'd like to propose inclusion of git commit that was used to convert the model, as well as the conversion script used. I think it'd be extremely useful for debugging issues with converted models that are shared via gguf on the hub. Additionally I think including some sort of versioning could be useful in case we need to break backwards compatibility. We could have a place in the code where we check if model version > n and if not we display suitable warning in case there's something like the tokenizer change. This could also be used by huggingface to display some sort of badge for deprecated models/deleting them.

TLDR:

mofosyne commented 1 month ago

@Galunid interesting. Could you also suggest a general.X key name similar to my table above so I can at least know how to name it if it makes sense to include? This is my guess:

Also I thought gguf file format structure already have versioning? gguf file structure image

Galunid commented 1 month ago

I'd propose

or something like that. I'd put them under llama_cpp key (?), rather than source, since we are setting properties regarding llama.cpp, rather than source model's. To be clear, by git commit I meant llama.cpp's HEAD pointer in the llama.cpp directory. One problem is that some users may not use git version, but downloaded one or installed using brew, so I think it'd be good to have fallback to something else. As for conversion script I meant something like convert-hf-to-gguf.py, or convert-legacy-llama.py with possibility of different values should other scripts appear. This could be hardcoded directly in scripts, so it should be simple to implement.

To be clear these are just some ideas that I feel make sense, but I'm very much open to critique.

Galunid commented 1 month ago

As for version, it's true that gguf has it, but I considered it more of a GGUF Standard version (major). The one I propose is different in a sense that it's more of a minor revision used mainly for llama.cpp. For example with tokenizer changes where we add tokenizer.pre meta-field we don't break the GGUF Standard (major) version, so we shouldn't change it. I propose minor version we could update for when we break backwards compatibility without changing the standard.

compilade commented 1 month ago

I'd propose

  • general.llama_cpp.git_hash

An unfortunate effect of including the git commit hash of llama.cpp would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.

Galunid commented 1 month ago

An unfortunate effect of including the git commit hash of llama.cpp would be to make it impossible to compare the checksums of converted models across commits. And this is very useful to test for regressions when refactoring.

I was thinking of adding a flag that could prevent that, or writing script like gguf-diff that could compare models in a way similar to other diff tools.

compilade commented 1 month ago

or writing script like gguf-diff that could compare models in a way similar to other diff tools.

I'd be very interested in this. Both non-interactively and interactively, to investigate differences. At first for metadata and yes/no for tensor data. But then comparisons would be useful for tensor data, like when comparing quantization (re)implementations. It would be nice to output hex samples of differring data granulated on the block size of the quant formats. I'm not sure about cross-quantization diffing, though.

But no need for all of that at first (since it would add quite a bit of complexity).

I was thinking of adding a flag that could prevent that

Both being able to set it to a known commit hash and, at least at first, being able to prevent adding the commit hash would be necessary.


I've been thinking... If metadata is something that should be easily changeable in GGUF models without rewriting all the tensor data, would it be more convenient if the GGUF metadata KV were at the end of the files instead?

For backward compatibility, instead of changing too deeply the GGUF format, there could even be a normal GGUF metadata key at the beginning (along with some others) which would store an offset for the extended metadata key-value pairs.

Or that offset could be calculated from the tensor info section.

Might not be worth it, but this would allow easier metadata additions even to existing GGUF models.

mofosyne commented 1 month ago

@compilade you might be interested that https://github.com/ggerganov/llama.cpp/discussions/7839 I was trying for a stab at Hashing the weights of the models to generate a uuid. Obviously my approach was an attempt at having a hash that would ignore quantisation, but anyway I hope that sparks some idea.

What are you trying to compare when diffing models? Difference in metadata? Difference in layers? (E.g. mismatch in layer X)?