Open maruel opened 3 months ago
I agree that this function needs to be refactored, it is essentially unmaintainable as it is. I would prefer a data-driven implementation, something similar to #6844 in which the default quantization schemes are defined in the same format, and can be used as a base to create other quantization schemes. In the long term, we should look into creating a tool to automatically find the best quantization schemes for a given model, but doing that in a reasonable amount of time may be challenging.
Regarding the ftype names, I don't really see a reason to change them, they can remain as names for the default quantization schemes. Custom quantization schemes could include the effective BPW in their name.
Some parts are dense (what
use_more_bits()'
s purpose exactly?) and takes a while to truly understand.
This is a really weird bit of code and I don't really see how every 3rd layer would benefit from the extra bump in quant for the down_proj
?
I did notice a couple of other things with this function that need looking at:
n_experts == 8
and some of the newer MoE models with many more experts are falling through.deepseek-v2
models are falling through too.In the long term, we should look into creating a tool to automatically find the best quantization schemes for a given model, but doing that in a reasonable amount of time may be challenging.
I forget all about this but I did think of a way that might not be as computationally intensive:
If this is still not feasible, then the next best thing would be to look at the existing llama_tensor_get_type
and manually extract the features it encoded (eg: based on layer index, layer number, n_experts, chosen quant based on ftype, and so on), and take lots of random samples of parameters sets, measure the perplexity of these, and then try to create a regression model from this.
It wouldn't be perfect, but a lot better than the current llama_tensor_get_type
function and likely not that hard to generate.
One other benefit of the regression tree method would be the possibility of bridging the large gap between all 5-bit and 4-bit quants (and to a lesser extent the gaps between other quants):
So after reading this: https://old.reddit.com/r/LocalLLaMA/comments/1efg2wv/llama_31_405b_exl2_quant_results/
and semi-inspired by the guy on huggingface who keeps aggressively posting "his" "fp16 embed quants"
everywhere...
and the llama-3.1
paper that mentions not quantizing the first and last layers...
I did a few experiments with multi-POV story-writing using this hacked version of llama_tensor_get_type()
:
// ### JUK ###
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
if (name == tn(LLM_TENSOR_OUTPUT, "weight") || name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("attn_k.weight") != std::string::npos || name.find("attn_v.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("attn_q.weight") != std::string::npos || name.find("attn_output.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
}
else if (name.find("ffn_down") != std::string::npos) {
new_type = GGML_TYPE_Q6_K;
}
else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
new_type = GGML_TYPE_Q5_K;
}
else {
new_type = GGML_TYPE_Q6_K;
}
}
else {
throw std::runtime_error(format("Unhandled tensor type: %s", name));
}
}
else
// ### JUK ###
(added at the top of the function to basically override all Q4_K_M
, Q5_K_M
and Q6_K
quants)
I came up with this after spending a couple of hours trying to ascertain the original design choices of the function (mainly to adapt for newer MoEs with > 8 experts to start with).
For command-r-plus:104b
, wizard-lm:8x22b
and mistral-large:123b
; I've found that around 8-10k tokens in (8-10 user/bot interactions) with 4-5 distinct character POVs:
Q4_K_M
, Q5_K_M
and even Q6_K
quants all start to get confused and mix up the different characters POVs. The names will start to get mixed up and the stories all start to blend together.Q6_K
--> Q5_hacked
, etc).I also used an imatrix
for these and IIRC, the imatrix
code for the Q6_K
function is commented out and Q8_0
function doesn't use the imatrix
anyway, so in effect the only place the imatrix
is being used for my hacked version is for the ffn_up
and ffn_gate
tensors.
I can't really test the smaller models as they tend to get confused with multi-POV story-writing even at fp16
, but I'm sure there must be some similar test that could be used to see if they too are suffering from the same problem...
NOTE: This is not a very scientific test, but I do think this function desperately needs bringing up to date... I suspect a lot of the wisdom that went into it was from a time when models were much smaller and had way lower effective context lengths compared to what we have now (and only 8-expert MoEs due to the hard-coded tests). I also think that any perplexity test, or otherwise, may need to look much further into the context to see the effects of different quantisation strategies in the future.
Interesting article related to this linked on reddit today:
https://fireworks.ai/blog/fireworks-quantization
The more I think about this the more I think single-token metrics are deeply flawed:
I terms of single-token metrics both of these will likely just show up as a small mistake at a single token's distribution, but the effects are way more severe due to the auto-regressive generation... Two mistakes with the same KL-divergence can have massively different effects on the future tokens.
I am quite keen to see if we can automate llama_tensor_get_type()
and pretty sure it is quite doable (at least as a proof of concept on small ~7b models), but without some good metric to work against I think it's just going to end up optimising the wrong thing... :/
@jukofyork Thanks for the reference. It was a great read. Still it's funny that they use fp16 (and not bf16) and fp8, which isn't necessarily best.
I'd advocate to separate the task in two:
I'd advocate to have the second discussion as a separate issue. As such, my preference would be about how to better make sure that we can refactor the code to be more readable and scalable while ensure the quantization selected for each tensor doesn't change for any model. wdyt?
@jukofyork Thanks for the reference. It was a great read. Still it's funny that they use fp16 (and not bf16) and fp8, which isn't necessarily best.
I'd advocate to separate the task in two:
1. My initial request is about refactoring the current code for maintainability and readability **without changing its behavior**. 2. The following discussion is about improving the algorithms.
I'd advocate to have the second discussion as a separate issue. As such, my preference would be about how to better make sure that we can refactor the code to be more readable and scalable while ensure the quantization selected for each tensor doesn't change for any model. wdyt?
Yeah, I agree, but perhaps the hard-coded tests for n_experts == 8
or n_experts <= 8
should probably be fixed - the tests that say bump the v_proj
matrix are currently falling through these tests and likely hurting the newer MoE models quite significantly for a miniscule change in the final model size (which was the whole point of the bumps from reading the source).
Alsoa couple of the tensor name sub-strings should probably be changed so as to catch the low-rank attention matrices used by the recent deepseek-v2
models too.
Would it be valuable to be able to calculate the bpw instead of hardcoding?
The exact BPW for each model and quantization is already provided in the logs:
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = llama2
Not sure if we could do better since the mixture of quantization types can be different for different model architectures, even if the ftype is the same.
Is the entropy accumulation and risk of unintended consequence in llama_model_ftype_name() considered a problem?
Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. https://github.com/ggerganov/llama.cpp/pull/6844
Yes, we can improve this. I also like the approach in #6844, though the PR got stale.
Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.
We can do that
Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.
- Automate the generation of the table driven unit test for ease of development. (I like this one)
Sounds good too
Convert the function into a formal fine-state machine with explicit internal states.
Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in https://github.com/ggerganov/llama.cpp/pull/6844.
I'm not sure I understand these - hopefully not over-engineering
Generate the function via another language. Python or Starlark would be potential choices.
Most likely no
Thanks for the feedback @ggerganov ! If someone wants to sign up right away, please express your interest. I'll do the same if my schedule opens up.
I did a quick check for the task "move quantization into its own source file" as I believe it's a good first step to get things in order. It seems like the most clean way to splice (IMHO) is to move llama_model_quantize()
into a new file named src/llama-quantize.cpp
. The rationale is that quantization is a stand-alone operation, while dequantization is part of normal serving. Moving this function requires a ton of support code and structs to become available outside of src/llama.cpp
. If it were the route to go, here's a subset of the steps I identified:
enum llm_arch
should be moved into a stand alone header file, e.g. src/gguf.h
(since it's internal) as it could become generated by gguf-py, MODEL_ARCH
in gguf-py/gguf/constants.py
becoming the source of truth.struct llama_model_loader
and struct LLM_KV
have to become available in llama-impl.h
so llama-quantize.cpp
can use it, along the declaration of llm_load_arch()
. There's more than that but I didn't go farther.format()
will have to be declared in llama-impl.h
.Before doing any PR, @ggerganov does that sound like a good direction? My gut feeling is that enum llm_arch
migration should be done as a first stand alone PR first. The rest though could create a fair amount of churn, so I don't know if you think it's worth the price. wdyt?
Then struct llama_model_loader and struct LLM_KV have to become available in llama-impl.h so llama-quantize.cpp can use it, along the declaration of llm_load_arch(). There's more than that but I didn't go farther.
Moving llama_model_loader
would drag a lot of stuff with it. It might be better to start separating just the quantization logic that is applied per tensor and keep the model loading in llama.cpp
for now
This issue was closed because it has been inactive for 14 days since being marked as stale.
Dang, I still planned to work on it.
Background Description
llama_tensor_get_type()
in src/llama.cpp is nearly 300 lines of conditions and has a bit of inconsistent logic (and formatting). Some parts are dense (whatuse_more_bits()
's purpose exactly?) and takes a while to truly understand. The chances of unintended side effects when changing the code is high.There's no way to get the effective bpw for variable quantization forms Qn_K, so it is hardcoded in
llama_model_ftype_name()
in src/llama.cpp and examples/quantize/quantize.cpp instead of being calculated.While for Qn_0 and Qn_1 flavors it's not a problem, it is a problem with Qn_K flavors where there's a judgement call on how deep each tensor is quantized. Furthermore, the Qn_K flavors are underspecified and the function is the only source of truth of what the end result (e.g. Q5_K_M vs Q5_K_L) should look like.
As quantization techniques continue to evolve, this discrepancy will become more difficult to manage; it's essentially an entropy accumulator. An example of desire for flexibility is https://github.com/ggerganov/llama.cpp/pull/6844.
Questions:
llama_model_ftype_name()
considered a problem?If all are "no" for maintainers, then we can close this request.
Possible Refactor Approaches
I don't have a strong personal opinion on what the end state of
llama_tensor_get_type()
should look like. Here are possible options (some far fetched) and maybe a combinations of these would be a path forward:I do not believe this is a performance critical function? It only has to be negligible run time, which could color the option(s) chosen.