Standardized prompting metadata

MoonRide303 commented 5 months ago

It would be nice to have standardized prompting metadata defined within GGUF files.

Currently when importing GGUF model to tools like Ollama it's necessary to explicitely provide prompting metadata - like the template and stopping sequences specific to given model, sometimes also default system message. It would be useful to include most commonly used prompting parameters in GGUF specification, so it could be read and used by applications like llama.cpp or Ollama.

My proposal, based on Ollama Modelfile definition:

prompting.template: string - default prompting template,
prompting.system: string - default system string,
prompting.stops: array[string] - default list of strings stopping generation.

ggerganov commented 5 months ago

There is already tokenizer.chat_template : string that should contain all the necessary info

MoonRide303 commented 5 months ago

@ggerganov Those seem to cover about the same area, but don't Jinja templates require pre-processing? Can those be used directly as input to -p in llama.cpp, for example? I didn't notice them anywhere in Ollama Modelfiles, either.

I am not sure if / how multiple types of templates should be supported. Maybe something like

prompting.template_type: string - type of prompting template ("simple", "jinja") or
prompting.template_jinja: string - jinja prompting template (standardize tokenizer.chat_template) would allow to support both styles?

Or, more flexible approach, just allow any set of types and templates via prompting.templates map.

Just an idea, but I feel like standardizing this could make experimenting with different types of models a bit easier.

teleprint-me commented 5 months ago

@MoonRide303

The model files are converted from torch which uses a zip format based on the python pickle library. Most models are created using transformers which is maintained by HuggingFace. The python scripts used to convert the models to gguf embed data into the model files. There are docs, references, and posts that detailed this already, e.g. PR #765 and #302.

Because HuggingFace uses jinja which is a HTML templating system, the "chat" template is usually in the automatically generated tokenizer_config.json file. It's surprisingly non-trivial attempting to extract a instruct/chat model template from these files as the information is usually inconsistent and unreliable. The reason for this is that the model creator decides how to configure all of these parameters before training begins; There is a deeper rationale for this, but this requires a more in-depth explanation that feels out of scope here. There are plenty of materials, including on HuggingFace, that break this down already.

When the models are converted from torch to gguf, the relevant metadata is read into memory and then written to the converted model file. You can use the GGUFReader class to reference the metadata which there is already an existing example file for doing.

You can get a feel for how a gguf model file is structured in more detail by referencing the gguf.constants module in llama.cpp/gguf-py.

I hope this provides some clarity into the current implementation.

As an aside, I would've personally used a Mapping, not Jinja2. I have no idea what the rationale behind this was, the consequence is that we're all stuck dealing with it now.

ggerganov / ggml

Standardized prompting metadata #774