ggerganov / llama.cpp

LLM inference in C/C++
MIT License
63.85k stars 9.15k forks source link

Quantization does not write the quantization version to `ftype` #1590

Closed philpax closed 1 year ago

philpax commented 1 year ago

Expected Behavior

When quantizing with llama.cpp, the quantization version should be written to the ftype in the hyperparameters.

Current Behavior

A ftype is produced by llama_model_quantize_internal and is passed through as-is to llama_file_saver, which writes it to disk without encoding it using GGML_QNT_VERSION:

https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L2052-L2068

https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/llama.cpp#L557

Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:

     Running `target/release/llm llama info -m models/llama/7B/koala-7B.ggmlv3.q5_1.bin`
[2023-05-25T00:10:05Z INFO  llm] Container type: Ggjt(3)
[2023-05-25T00:10:05Z INFO  llm] Hyperparameters: Hyperparameters { n_vocab: 32000, n_embd: 4096, n_mult: 256, n_head: 32, n_layer: 32, n_rot: 128, file_type: FileType { format: MostlyQ5_1, quantization_version: 0 } }
[2023-05-25T00:10:05Z INFO  llm] Vocabulary size: 32000

Environment and Context

This was reproduced on https://github.com/ggerganov/llama.cpp/commit/ac7876ac20124a15a44fd6317721ff1aa2538806. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. make
  2. ./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
  3. Check the ftype in the written hyperparameters.
mgroeber9110 commented 1 year ago

Looking at llm it seems that the intended format for ftype is

ftype = new_type + GGML_QNT_VERSION*GGML_QNT_VERSION_FACTOR

As far as I can see, this also requires updating the loader in llama.cpp (read_hparams), as this currently does not strip the quantization version from the ftype:

hparams.ftype = (enum llama_ftype) file.read_u32();

So, I think this would need to be treated as a breaking change, as the code currently neither writes nor expects to read GGML_QNT_VERSION - right now, GGML_QNT_VERSION_FACTOR is completely ignored.

Should read_hparams throw an exception if the quantization_version in the file being loaded does not match GGML_QNT_VERSION of the current code?

philpax commented 1 year ago

What we did was to treat GGJTv2 as QNT1 and GGJTv3 as QNT2; https://github.com/rustformers/llm/blob/main/crates/llm-base/src/loader.rs#L386-L394

The llama.cpp loader can make the same assumption and behave correctly from now on (including raising an error if the versions don't match). The only point of concern is that other llama.cpp-based loaders aren't going to modulo the ftype until they update, so they'll read an invalid ftype.

In practice, I'm not sure if this would be an issue; iirc the ftype isn't actually used much, as the tensors carry their own type and ftype isn't used to decide anything important. (I could be wrong! I haven't looked into this recently)

@LostRuins would koboldcpp have any issues with this? Does it attach any existing semantic meaning to the ftype?

LostRuins commented 1 year ago

@philpax No, there are no issue with determining ftype in the file for me so far.

Modulo for ftype is only required for ggml magic files (0x67676d6c), not ggjt (0x67676a74), for ggjt I just read the magic, ftype and file version directly.

This is because ggjt stores ftype and magic and version (LLAMA_FILE_VERSION) separately https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L545, whereas ggml applies multiplexing to store ftype and version within the same field. Ref: https://github.com/ggerganov/ggml/issues/150#issuecomment-1546625668

So you should not be using GGML_QNT_VERSION_FACTOR at all for ggjt, it is meant for ggml

My implementation: https://github.com/LostRuins/koboldcpp/blob/concedo/model_adapter.cpp#L220

philpax commented 1 year ago

I suspect that wasn't an intentional design choice, as GGML_QNT_VERSION_FACTOR is used for all of the other model implementations and there's nothing stopping the use of GGJTv2/3 with those models (llm can load both GGML and GGJTv3). @ggerganov can you confirm what the ideal behaviour should be here?

(If this leads to a format break - which I hope it doesn't / if it does, it's seamless - I'd really strongly suggest splitting up the ftype/qnt version, there's no reason to multiplex if strict compatibility isn't necessary)

ggerganov commented 1 year ago

The explanation by @LostRuins is correct - for ggjt format used by LLaMA derivatives, the version is stored separately. For the ggml format, there wasn't a version field, so adding it would have caused breaking change. Therefore, we decided to encode GGML_QNT_VERSION into the ftype field by multiplying with GGML_QNT_VERSION_FACTOR.

We can do the same encoding for the ftype field in ggjt in order to be consistent in the way we store ftype. This will not be a breaking change.

But adding a version to ggml would be a breaking change. I'm ok with that, but I think it's better to avoid it

philpax commented 1 year ago

Ah okay, that makes sense. Well, given that, it'd be nice to write the quantization version in the ftype, but it doesn't matter too much as we can determine it from the format version.

With that being said, if https://github.com/ggerganov/llama.cpp/issues/1575#issuecomment-1566196582 / https://github.com/ggerganov/ggml/issues/147 gets implemented, I'd suggest moving the file format loader/saver to ggml, and sharing the format code, including Python conversion code, between all models including llama.cpp.

This would have a few benefits:

LostRuins commented 1 year ago

@philpax I'm personally more in the camp of - if it's not broken don't fix it - so given that the version+ftype multiplex was already added to existing ggml and currently working I'm not really in favor of changing it.

I'm wondering what you mean by "tell user it's unsupported instead of guessing", if everything uses the same ggjt magic you still won't be able to tell a neoX from a ggjt model except by maybe binning the vocab ranges.

Edit:

That said, switching from a packed struct to the Key-value metadata you suggested in https://github.com/ggerganov/llama.cpp/issues/1575 seems like a good idea if it can be implemented robustly. A few fields should be enforced for all models for a model to be considered compliant. Certainly would be good to definitively tell apart the different architectures, such as GPT2 and GPTJ.

A final magic change should be considered if this ends up being implemented.

philpax commented 1 year ago

Yup, you got it - if we have kv metadata, we can store the architecture in there too, and that should finally solve the issues we've been having around identifying models.

LostRuins commented 1 year ago

But it has to be consistent. Leaving the standard freely extensible but undefined can rapidly lead to fracturing formats as each implementation adds their own keys. That's how you get monstrosities in modern javascript such as

var event = event || window.event; var charCode = event.keyCode || event.which; when every browser adds their own approach.

Last we need is for some models to use "ftype" and others to use "filetype", "quanttype", "qType" and "quantTypeFormat".

Random ideas: Should the keys be stored as an ASCII string (fixed length? nul terminated?) Will it handle unicode? (JSON keys permit that) What's your opinion on case insensitive keys?

philpax commented 1 year ago

Hm yeah, good points. Okay, how about snake_case ascii keys with no null termination (length comes before the string, and I think that's consistent with the tensors)? If there's demand for Unicode keys, that can be added later, but I've not seen much of that in the HF configs. The values can store as much UTF-8 as they want, but the keys are constrained to snake case.

I don't want to be too prescriptive on key length, but 256 characters max should be more than sufficient. You could maaaybe exceed that if you were specifying parameters for individual tensors and needed to include their full name, but I think that's a problem we'll deal with when we get to it.

By the way, we're having this discussion across several issues. @ggerganov do you mind if I create one issue on ggml for this proposed new model format, and then we can close this / #1575 / https://github.com/ggerganov/ggml/issues/147 ?

ggerganov commented 1 year ago

@philpax Yes, lets consolidate the discussion into a new ggml issue. Start with the idea from here: https://github.com/ggerganov/llama.cpp/issues/1575#issuecomment-1566196582

ekdnam commented 1 year ago

I would like to solve this. Can anyone please guide?

ggerganov commented 1 year ago

https://github.com/ggerganov/llama.cpp/pull/2398