Refactor: decide the future of llama_tensor_get_type()

maruel commented 3 months ago

Background Description

llama_tensor_get_type() in src/llama.cpp is nearly 300 lines of conditions and has a bit of inconsistent logic (and formatting). Some parts are dense (what use_more_bits()'s purpose exactly?) and takes a while to truly understand. The chances of unintended side effects when changing the code is high.

There's no way to get the effective bpw for variable quantization forms Qn_K, so it is hardcoded in llama_model_ftype_name() in src/llama.cpp and examples/quantize/quantize.cpp instead of being calculated.

While for Qn_0 and Qn_1 flavors it's not a problem, it is a problem with Qn_K flavors where there's a judgement call on how deep each tensor is quantized. Furthermore, the Qn_K flavors are underspecified and the function is the only source of truth of what the end result (e.g. Q5_K_M vs Q5_K_L) should look like.

As quantization techniques continue to evolve, this discrepancy will become more difficult to manage; it's essentially an entropy accumulator. An example of desire for flexibility is https://github.com/ggerganov/llama.cpp/pull/6844.

Questions:

Would it be valuable to be able to calculate the bpw instead of hardcoding?
Is the entropy accumulation and risk of unintended consequence in llama_model_ftype_name() considered a problem?
Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. https://github.com/ggerganov/llama.cpp/pull/6844

If all are "no" for maintainers, then we can close this request.

Possible Refactor Approaches

I don't have a strong personal opinion on what the end state of llama_tensor_get_type() should look like. Here are possible options (some far fetched) and maybe a combinations of these would be a path forward:

Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.
Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.
- Automate the generation of the table driven unit test for ease of development. (I like this one)
Convert the function into a formal fine-state machine with explicit internal states.
Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in https://github.com/ggerganov/llama.cpp/pull/6844.
Generate the function via another language. Python or Starlark would be potential choices.

I do not believe this is a performance critical function? It only has to be negligible run time, which could color the option(s) chosen.

slaren commented 3 months ago

I agree that this function needs to be refactored, it is essentially unmaintainable as it is. I would prefer a data-driven implementation, something similar to #6844 in which the default quantization schemes are defined in the same format, and can be used as a base to create other quantization schemes. In the long term, we should look into creating a tool to automatically find the best quantization schemes for a given model, but doing that in a reasonable amount of time may be challenging.

Regarding the ftype names, I don't really see a reason to change them, they can remain as names for the default quantization schemes. Custom quantization schemes could include the effective BPW in their name.

jukofyork commented 3 months ago

Some parts are dense (what use_more_bits()'s purpose exactly?) and takes a while to truly understand.

This is a really weird bit of code and I don't really see how every 3rd layer would benefit from the extra bump in quant for the down_proj?

I did notice a couple of other things with this function that need looking at:

There are several tests for n_experts == 8 and some of the newer MoE models with many more experts are falling through.
The low-rank attention matrices of the deepseek-v2 models are falling through too.

jukofyork commented 3 months ago

In the long term, we should look into creating a tool to automatically find the best quantization schemes for a given model, but doing that in a reasonable amount of time may be challenging.

I forget all about this but I did think of a way that might not be as computationally intensive:

Start with an unquantized model and measure the perplexity.
Measure the perplexity after changing each parameter in isolation.
Train a regression model (tree/ensemble) to predict the drop in perplexity.

If this is still not feasible, then the next best thing would be to look at the existing llama_tensor_get_type and manually extract the features it encoded (eg: based on layer index, layer number, n_experts, chosen quant based on ftype, and so on), and take lots of random samples of parameters sets, measure the perplexity of these, and then try to create a regression model from this.

It wouldn't be perfect, but a lot better than the current llama_tensor_get_type function and likely not that hard to generate.

jukofyork commented 3 months ago

One other benefit of the regression tree method would be the possibility of bridging the large gap between all 5-bit and 4-bit quants (and to a lesser extent the gaps between other quants):

308153054-ac93a0df-e308-458f-8ff8-04aed10627e4

jukofyork commented 3 months ago

So after reading this: https://old.reddit.com/r/LocalLLaMA/comments/1efg2wv/llama_31_405b_exl2_quant_results/

and semi-inspired by the guy on huggingface who keeps aggressively posting "his" "fp16 embed quants" everywhere...

and the llama-3.1 paper that mentions not quantizing the first and last layers...

I did a few experiments with multi-POV story-writing using this hacked version of llama_tensor_get_type():

    // ### JUK ###
    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
        if (name == tn(LLM_TENSOR_OUTPUT, "weight") || name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("attn_k.weight") != std::string::npos || name.find("attn_v.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("attn_q.weight") != std::string::npos || name.find("attn_output.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("ffn_down") != std::string::npos) {
            new_type = GGML_TYPE_Q6_K;
        }
        else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
            if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
                new_type = GGML_TYPE_Q4_K;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else {
                new_type = GGML_TYPE_Q6_K;
            }
        }
        else {
            throw std::runtime_error(format("Unhandled tensor type: %s", name));
        }
    }
    else
    // ### JUK ###

(added at the top of the function to basically override all Q4_K_M, Q5_K_M and Q6_K quants)

I came up with this after spending a couple of hours trying to ascertain the original design choices of the function (mainly to adapt for newer MoEs with > 8 experts to start with).

For command-r-plus:104b, wizard-lm:8x22b and mistral-large:123b; I've found that around 8-10k tokens in (8-10 user/bot interactions) with 4-5 distinct character POVs:

The default Q4_K_M, Q5_K_M and even Q6_K quants all start to get confused and mix up the different characters POVs. The names will start to get mixed up and the stories all start to blend together.
The hacked version above doesn't seem to get confused in the same way even if you have to use 1 quant less (eg: Q6_K --> Q5_hacked, etc).

I also used an imatrix for these and IIRC, the imatrix code for the Q6_K function is commented out and Q8_0 function doesn't use the imatrix anyway, so in effect the only place the imatrix is being used for my hacked version is for the ffn_up and ffn_gate tensors.

I can't really test the smaller models as they tend to get confused with multi-POV story-writing even at fp16, but I'm sure there must be some similar test that could be used to see if they too are suffering from the same problem...

NOTE: This is not a very scientific test, but I do think this function desperately needs bringing up to date... I suspect a lot of the wisdom that went into it was from a time when models were much smaller and had way lower effective context lengths compared to what we have now (and only 8-expert MoEs due to the hard-coded tests). I also think that any perplexity test, or otherwise, may need to look much further into the context to see the effects of different quantisation strategies in the future.

jukofyork commented 3 months ago

Interesting article related to this linked on reddit today:

https://fireworks.ai/blog/fireworks-quantization

The more I think about this the more I think single-token metrics are deeply flawed:

The example of the POV characters getting mixed up as explained above.
Coding models that make a small mistake when copying code to make changes.

I terms of single-token metrics both of these will likely just show up as a small mistake at a single token's distribution, but the effects are way more severe due to the auto-regressive generation... Two mistakes with the same KL-divergence can have massively different effects on the future tokens.

I am quite keen to see if we can automate llama_tensor_get_type() and pretty sure it is quite doable (at least as a proof of concept on small ~7b models), but without some good metric to work against I think it's just going to end up optimising the wrong thing... :/

maruel commented 3 months ago

@jukofyork Thanks for the reference. It was a great read. Still it's funny that they use fp16 (and not bf16) and fp8, which isn't necessarily best.

I'd advocate to separate the task in two:

My initial request is about refactoring the current code for maintainability and readability without changing its behavior.
The following discussion is about improving the algorithms.

I'd advocate to have the second discussion as a separate issue. As such, my preference would be about how to better make sure that we can refactor the code to be more readable and scalable while ensure the quantization selected for each tensor doesn't change for any model. wdyt?

jukofyork commented 3 months ago

@jukofyork Thanks for the reference. It was a great read. Still it's funny that they use fp16 (and not bf16) and fp8, which isn't necessarily best.

I'd advocate to separate the task in two:
1. My initial request is about refactoring the current code for maintainability and readability **without changing its behavior**.

2. The following discussion is about improving the algorithms.
I'd advocate to have the second discussion as a separate issue. As such, my preference would be about how to better make sure that we can refactor the code to be more readable and scalable while ensure the quantization selected for each tensor doesn't change for any model. wdyt?

Yeah, I agree, but perhaps the hard-coded tests for n_experts == 8 or n_experts <= 8 should probably be fixed - the tests that say bump the v_proj matrix are currently falling through these tests and likely hurting the newer MoE models quite significantly for a miniscule change in the final model size (which was the whole point of the bumps from reading the source).

Alsoa couple of the tensor name sub-strings should probably be changed so as to catch the low-rank attention matrices used by the recent deepseek-v2 models too.

ggerganov commented 3 months ago

Would it be valuable to be able to calculate the bpw instead of hardcoding?

The exact BPW for each model and quantization is already provided in the logs:

llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = llama2

Not sure if we could do better since the mixture of quantization types can be different for different model architectures, even if the ftype is the same.

Is the entropy accumulation and risk of unintended consequence in llama_model_ftype_name() considered a problem?

Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. https://github.com/ggerganov/llama.cpp/pull/6844

Yes, we can improve this. I also like the approach in #6844, though the PR got stale.

Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.

We can do that

Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.

Automate the generation of the table driven unit test for ease of development. (I like this one)

Sounds good too

Convert the function into a formal fine-state machine with explicit internal states.

Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in https://github.com/ggerganov/llama.cpp/pull/6844.

I'm not sure I understand these - hopefully not over-engineering

Generate the function via another language. Python or Starlark would be potential choices.

Most likely no

maruel commented 2 months ago

Thanks for the feedback @ggerganov ! If someone wants to sign up right away, please express your interest. I'll do the same if my schedule opens up.

I did a quick check for the task "move quantization into its own source file" as I believe it's a good first step to get things in order. It seems like the most clean way to splice (IMHO) is to move llama_model_quantize() into a new file named src/llama-quantize.cpp. The rationale is that quantization is a stand-alone operation, while dequantization is part of normal serving. Moving this function requires a ton of support code and structs to become available outside of src/llama.cpp. If it were the route to go, here's a subset of the steps I identified:

I believe enum llm_arch should be moved into a stand alone header file, e.g. src/gguf.h (since it's internal) as it could become generated by gguf-py, MODEL_ARCH in gguf-py/gguf/constants.py becoming the source of truth.
Then struct llama_model_loader and struct LLM_KV have to become available in llama-impl.h so llama-quantize.cpp can use it, along the declaration of llm_load_arch(). There's more than that but I didn't go farther.
Some utility functions like format() will have to be declared in llama-impl.h.

Before doing any PR, @ggerganov does that sound like a good direction? My gut feeling is that enum llm_arch migration should be done as a first stand alone PR first. The rest though could create a fair amount of churn, so I don't know if you think it's worth the price. wdyt?

ggerganov commented 2 months ago

Then struct llama_model_loader and struct LLM_KV have to become available in llama-impl.h so llama-quantize.cpp can use it, along the declaration of llm_load_arch(). There's more than that but I didn't go farther.

Moving llama_model_loader would drag a lot of stuff with it. It might be better to start separating just the quantization logic that is applied per tensor and keep the model loading in llama.cpp for now

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

maruel commented 1 month ago

Dang, I still planned to work on it.

ggerganov / llama.cpp

Refactor: decide the future of llama_tensor_get_type() #8736

Background Description

Possible Refactor Approaches