Closed AutonomicPerfectionist closed 6 months ago
Somehow, we ended up with a variety of quantum mixtures:
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q4_0: 1 tensors
llama_model_loader: - type q3_K: 128 tensors
llama_model_loader: - type q4_K: 92 tensors
llama_model_loader: - type q5_K: 4 tensors
llm_load_print_meta: format = GGUF V1 (support until nov 2023)
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q3_K: 129 tensors
llama_model_loader: - type q4_K: 92 tensors
llama_model_loader: - type q5_K: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V2 (latest)
Though, it's unclear why old model produces different results between master
and 45855b3.
We should look into this
I'm not at my computer currently but while debugging I recall the old model did exhibit similar symptoms before, after a change to how default rope values were handled, and was fixed soon after.
I believe it was after a5661d7e71d15b8dfc81bc0510ba912ebe85dfa3 that a similar behavior was exhibited, but it was resolved by 51a7cf5c6e490b2f51c82daa76c4ca4f8d845826
Perhaps rope is the source of this issue?
Perhaps rope is the source of this issue?
should be, since the code llamas rope base differs from the fallback value.
It's possible, although all logs indicate using the correct 1e6 rope freq base value.
I did a git bisect on the custom-attention-mask
and found the commit that caused the regression in the old model was e1067efbfa0895115cc639ead8b22cdceef4eca1. The specific line is:
https://github.com/ggerganov/llama.cpp/blob/e1067efbfa0895115cc639ead8b22cdceef4eca1/llama.cpp#L4106
On master, reverting the corresponding line:
https://github.com/ggerganov/llama.cpp/blob/f5ef5cfb18148131fcf45bdd2331f0db5ab7c3d0/llama.cpp#L4072
To:
kv_self.n = llama_kv_cache_cell_max(kv_self);
Fixes the regression in the old model (willing to bet this breaks other things, though). The newer model still behaves the same, but that gives a clue as to why the new model is broken.
I looked into this more deeply and I actually don't think this is a regression. When generating the source code, one of the first tokens is on the line after the opening {
. In one case we sample long
and in the other case we sample double
.
Looking at the logits for that token with and without the proposed patch, we see that they are very close to each other:
# without patch
max logit = 16.960428
long->logit = 16.960428
double->logit = 16.878107
long
# with patch
max logit = 16.868914
"long".logit = 16.844076
"double".logit = 16.868914
double
The version on master
forces the computation of some operations to always involve SIMD (due to having 32 elements), while before #3228 some of these operations were performed without SIMD. This can lead to slight numerical differences and hence influence the results. Keep in mind these are auto-regressive models, so small changes at the start can lead to large changes in the end.
Technically, the "incorrect" answer is still correct - it just does some extra iterations to approximate the inverse square root (see wikipedia).
So I think everything works normally here and the difference when using the old model on the CPU is due to rounding differences when SIMD is utilized or not for some operations.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Using this Codellama 7B Q3_K_M model uploaded by @TheBloke on August 24th with llama.cpp versions up until #3228 was merged produced the following output:
Current Behavior
Running any moderately recent version of llama.cpp with the newest codellama 7b Q3_K_M uploaded by TheBloke here, or running the older version of the model with llama.cpp's current master produces the following output:
Both models produce the same output on master, whereas the old model produced the correct output up until #3228
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Linux pop-os 6.4.6-76060406-generic #202307241739~1692717645~22.04~5597803 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
make
./main -t 4 -m ./models/codellama-7b.Q3_K_M.gguf --color -c 512 --temp 0.0 --repeat_penalty 1.0 -n 128 -p "double fast_inverse_square_root(double x"
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Logs for all 4 tested cases are attached, Github wouldn't let me paste them in here.
old-model-commit-45855b3.log new-model-commit-45855b3.log old-model-master.log new-model-master.log