AutonomicPerfectionist commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Using this Codellama 7B Q3_K_M model uploaded by @TheBloke on August 24th with llama.cpp versions up until #3228 was merged produced the following output:

$ ./main -t 4 -m ./models/codellama-7b.Q3_K_M.gguf.old --color -c 512 --temp 0.0 --repeat_penalty 1.0 -n 128 -p "double fast_inverse_square_root(double x"

 double fast_inverse_square_root(double x)
{
    double xhalf = 0.5 * x;
    int64_t i = *(int64_t*)&x;
    i = 0x5fe6ec85e7de30da - (i >> 1);
    x = *(double*)&i;
    x = x * (1.5 - xhalf * x * x);
    return x;
}

double fast_inverse_square_root_2(double x)
{
    double xhalf = 0.5 *
llama_print_timings:        load time =   399.81 ms
llama_print_timings:      sample time =     4.18 ms /   128 runs   (    0.03 ms per token, 30600.05 tokens per second)
llama_print_timings: prompt eval time =  1082.34 ms /    13 tokens (   83.26 ms per token,    12.01 tokens per second)
llama_print_timings:        eval time = 16587.27 ms /   127 runs   (  130.61 ms per token,     7.66 tokens per second)
llama_print_timings:       total time = 17758.83 ms
Log end

Current Behavior

Running any moderately recent version of llama.cpp with the newest codellama 7b Q3_K_M uploaded by TheBloke here, or running the older version of the model with llama.cpp's current master produces the following output:

$ ./main -t 4 -m ./models/codellama-7b.Q3_K_M.gguf.old --color -c 512 --temp 0.0 --repeat_penalty 1.0 -n 128 -p "double fast_inverse_square_root(double x"

 double fast_inverse_square_root(double x)
{
    long i;
    double x2, y;
    const double threehalfs = 1.5;

    x2 = x * 0.5;
    y  = x;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( double * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );
    y  = y * ( threehalf
llama_print_timings:        load time =  1603.99 ms
llama_print_timings:      sample time =     4.17 ms /   128 runs   (    0.03 ms per token, 30732.29 tokens per second)
llama_print_timings: prompt eval time =  1096.09 ms /    13 tokens (   84.31 ms per token,    11.86 tokens per second)
llama_print_timings:        eval time = 16623.97 ms /   127 runs   (  130.90 ms per token,     7.64 tokens per second)
llama_print_timings:       total time = 17809.38 ms
Log end

Both models produce the same output on master, whereas the old model produced the correct output up until #3228

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         4100.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
                         rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
                          aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2
                          erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

$ uname -a Linux pop-os 6.4.6-76060406-generic #202307241739~1692717645~22.04~5597803 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Download model from here
Clone llama.cpp and build with make
Run ./main -t 4 -m ./models/codellama-7b.Q3_K_M.gguf --color -c 512 --temp 0.0 --repeat_penalty 1.0 -n 128 -p "double fast_inverse_square_root(double x"

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Logs for all 4 tested cases are attached, Github wouldn't let me paste them in here.

old-model-commit-45855b3.log new-model-commit-45855b3.log old-model-master.log new-model-master.log

ggerganov commented 1 year ago

Somehow, we ended up with a variety of quantum mixtures:

old-model-commit-45855b3.log

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q4_0:    1 tensors
llama_model_loader: - type q3_K:  128 tensors
llama_model_loader: - type q4_K:   92 tensors
llama_model_loader: - type q5_K:    4 tensors
llm_load_print_meta: format         = GGUF V1 (support until nov 2023)

new-model-commit-45855b3.log

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  129 tensors
llama_model_loader: - type q4_K:   92 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V2 (latest)

Though, it's unclear why old model produces different results between master and 45855b3. We should look into this

AutonomicPerfectionist commented 1 year ago

I'm not at my computer currently but while debugging I recall the old model did exhibit similar symptoms before, after a change to how default rope values were handled, and was fixed soon after.

I believe it was after a5661d7e71d15b8dfc81bc0510ba912ebe85dfa3 that a similar behavior was exhibited, but it was resolved by 51a7cf5c6e490b2f51c82daa76c4ca4f8d845826

Perhaps rope is the source of this issue?

Green-Sky commented 1 year ago

Perhaps rope is the source of this issue?

should be, since the code llamas rope base differs from the fallback value.

ggerganov commented 1 year ago

It's possible, although all logs indicate using the correct 1e6 rope freq base value.

AutonomicPerfectionist commented 1 year ago

I did a git bisect on the custom-attention-mask and found the commit that caused the regression in the old model was e1067efbfa0895115cc639ead8b22cdceef4eca1. The specific line is: https://github.com/ggerganov/llama.cpp/blob/e1067efbfa0895115cc639ead8b22cdceef4eca1/llama.cpp#L4106

On master, reverting the corresponding line:

https://github.com/ggerganov/llama.cpp/blob/f5ef5cfb18148131fcf45bdd2331f0db5ab7c3d0/llama.cpp#L4072

To:

kv_self.n = llama_kv_cache_cell_max(kv_self);

Fixes the regression in the old model (willing to bet this breaks other things, though). The newer model still behaves the same, but that gives a clue as to why the new model is broken.

ggerganov commented 1 year ago

I looked into this more deeply and I actually don't think this is a regression. When generating the source code, one of the first tokens is on the line after the opening {. In one case we sample long and in the other case we sample double.

Looking at the logits for that token with and without the proposed patch, we see that they are very close to each other:

# without patch
       max logit = 16.960428
     long->logit = 16.960428
   double->logit = 16.878107
 long

# with patch
       max logit = 16.868914
    "long".logit = 16.844076
  "double".logit = 16.868914
 double

The version on master forces the computation of some operations to always involve SIMD (due to having 32 elements), while before #3228 some of these operations were performed without SIMD. This can lead to slight numerical differences and hence influence the results. Keep in mind these are auto-regressive models, so small changes at the start can lead to large changes in the end.

Technically, the "incorrect" answer is still correct - it just does some extra iterations to approximate the inverse square root (see wikipedia).

So I think everything works normally here and the difference when using the old model on the CPU is due to rounding differences when SIMD is utilized or not for some operations.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

[User] Regression with CodeLlama 7B #3384

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs