ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.15k stars 9.34k forks source link

Certain 70B Q4_0 quants outputting gibberish (other quant formats unaffected) #3148

Closed TheBloke closed 5 months ago

TheBloke commented 1 year ago

Hi guys

I've just had reports that two specific Q4_0 70B models are outputting gibberish, and I've confirmed the same.

Example file with this issue: https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf Second example, made 12 days ago: https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-Creative-GGUF/blob/main/airoboros-l2-70b-2.1-creative.Q4_0.gguf

I've had no reports of problems with other quants. I've tested Q4_K_M and Q5_0 from the same model and commit, and both were fine.

The Spicyboros bad q4_0 was made with commit d54a402

At first I thought it was a recent problem until I realised there was also a file from 12 days ago with the same issue.

But a 70B q4_0 I made three days ago, with commit 21ac3a1, is fine: https://huggingface.co/TheBloke/ORCA_LLaMA_70B_QLoRA-GGUF/blob/main/orca_llama_70b_qlora.Q4_0.gguf

I notice both broken models were made by Jon Durbin - could there be something in the source model causing this? But only for q4_0? That's weird.

Full iutput when testing Spicyboros 70B Q4_0 70B gguf file (too long to post in one comment!) : https://gist.github.com/TheBloke/b7a45d3e5ff1432f90aa221de6a5fb08#file-q4_0-gibberish-log

Trimmed log:

(pytorch2)  ubuntu@a10:/workspace/git/gguf-llama (master ✔) ᐅ ./main -m /workspace/spicyboros-70B-2.2.Q4_0.gguf -c 4096 -p "A chat.\nUSER: Write a story about llamas\nASSISTANT:" -n 128
Log start
main: build = 1215 (89e8959)
main: seed  = 1694547445
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A10, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 723 tensors from /workspace/spicyboros-70B-2.2.Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
... trimmed ...
llama_model_loader: - tensor  716:           blk.79.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  717:             blk.79.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor  718:           blk.79.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor  719:          blk.79.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  720:           blk.79.ffn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  721:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:               general.quantization_version u32
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 37070.97 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/83 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size =  561.47 MB
llama_new_context_with_model: VRAM scratch buffer: 560.00 MB

system_info: n_threads = 15 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 128, n_keep = 0

 A chat.\nUSER: Write a story about llamas\nASSISTANT:oid◄◄letteakoÝbrieпіbrieroberiaÝiomcych Insertomengenommen prolong Feder Sebbrie◄ fifigliaÝ Matthoauthandro◄loyee◄ obser cabarfgresloyeeigliaMITgenommen◄тистиbrie stat◄
staviq commented 1 year ago

Not sure if it's related, but \ is not being processed, form the top of my head I think you need -e for that I checked and it makes no difference.

I also noticed token candidates are almost identical for each (next) token:

1st:
[1694552218] top 10 candidates:
[1694552218]  - 23747: '        brie' (0,103)
[1694552218]  - 30322: '         ◄' (0,070)
[1694552218]  - 30777: '          Ý' (0,038)
[1694552218]  - 12964: '       iglia' (0,037)
[1694552218]  -  7776: '         cab' (0,037)
[1694552218]  - 21096: '    genommen' (0,035)
[1694552218]  - 25168: '        teck' (0,032)
[1694552218]  -  7201: '        gres' (0,031)
[1694552218]  -  8749: '        eria' (0,031)
[1694552218]  - 13716: '         rob' (0,031)

...

2nd:
[1694552270] top 10 candidates:
[1694552270]  - 23747: '        brie' (0,104)
[1694552270]  - 30322: '         ◄' (0,071)
[1694552270]  - 30777: '          Ý' (0,038)
[1694552270]  - 12964: '       iglia' (0,038)
[1694552270]  -  7776: '         cab' (0,037)
[1694552270]  - 21096: '    genommen' (0,035)
[1694552270]  - 25168: '        teck' (0,032)
[1694552270]  -  7201: '        gres' (0,032)
[1694552270]  - 13716: '         rob' (0,031)
[1694552270]  -  8749: '        eria' (0,031)
slaren commented 1 year ago

The model is definitely broken, the ppl of first blocks is [1]221713.5241,[2]187850.4390,[3]177167.9363,.

TheBloke commented 1 year ago

I just made another for https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf, this time with commit 4f7cd6ba9c88d3ca9a207b6e04f8b2b1efd707b8

File came out identical - same sha256sum - and of course therefore the same gibberish output.

Very odd!

I don't know if it's of any help, but here's the full log of making the new q4_0, first making the FP16 and the q4_0. The Fp16 we know is fine because all the other quants are fine: https://gist.github.com/TheBloke/6fe3bb4d870e45c97acb71772906caaf#file-quant-spicyboros-q4_0-log

slaren commented 1 year ago

For what it is worth, I looked at the mean, min and max of each tensor and compared it to the Q4_K_S model and I didn't see anything obviously out of place. The tokenizer also looks fine.

staviq commented 1 year ago

That's the only changes in quantize.cpp in the last week (5 days ago) : https://github.com/ggerganov/llama.cpp/commit/00d62adb79bf914a95fb9a2e8f42f3029e76d62c#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7 ( was that else if -> if intended ? )

And 2 weeks ago: https://github.com/ggerganov/llama.cpp/commit/5d6f19f16b2173afe2d5c6aee2f5c9fc31038eba#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7

Other than that all changes were cosmetic, all the way to gguf merge.

So whatever got borked, it's in either of those.

@Cebtenzzre That commit was about gcc warning fixes, and that is a functional change, wasn't that else { if(){} } supposed to be else if () {} ? https://github.com/ggerganov/llama.cpp/commit/00d62adb79bf914a95fb9a2e8f42f3029e76d62c#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7

cebtenzzre commented 1 year ago

That commit was about gcc warning fixes, and that is a functional change, wasn't that else { if(){} } supposed to be else if () {} ?

No. If the condition is true, the function returns, so the only way to get to that line is if the condition was false - the 'else' is unnecessary.

staviq commented 1 year ago

I ran quantize f16->Q4_0 on open-llama-3b-v2-f16 on commits ebcee20 to current, and I got identical checksum every time, so it seems specific to 70b.

Anybody has a link to f16 of any of mentioned models ? I can run a script overnight to find if checksum changes with commits.

cebtenzzre commented 1 year ago

Anybody has a link to f16 of any of mentioned models ?

Ones that reproduce the gibberish: https://huggingface.co/jondurbin/spicyboros-70b-2.2 https://huggingface.co/jondurbin/airoboros-l2-70b-2.1-creative

One that was apparently OK on an earlier commit: https://huggingface.co/fangloveskari/ORCA_LLaMA_70B_QLoRA

staviq commented 1 year ago

Yes, I've seen those but aren't they raw f32 ? That's not a problem, it's just with f16 I could run wget && script right now, and with raw I'm gonna have to covert them in the morning and results would be probably tomorrow evening.

Edit: It's not that bad, HF isn't throttling much this time, only 20min download.

cebtenzzre commented 1 year ago

Yes, I've seen those but aren't they raw f32 ?

No, 145GiB 70B should be fp16. I think most HF uploads are. Compare to TheBloke/Llama-2-70B-fp16.

staviq commented 1 year ago

Yes, I've seen those but aren't they raw f32 ?

No, 145GiB 70B should be fp16. I think most HF uploads are. Compare to TheBloke/Llama-2-70B-fp16.

Ok.

Edit: I'll finish tomorrow, it's like 5 in the morning and I can't see what I'm missing here:

root@ch81:/storage/2sata/llama/llama.cpp# python3 ./convert.py --outtype f16 --outfile test.gguf "/storage/2sata/llama/spicyboros-70b-2.2/"
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00001-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00001-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00002-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00003-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00004-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00005-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00006-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00007-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00008-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00009-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00010-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00011-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00012-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00013-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00014-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00015-of-00015.bin
params = Params(n_vocab=32000, n_embd=8192, n_layer=80, n_ctx=4096, n_ff=28672, n_head=64, n_head_kv=8, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/storage/2sata/llama/spicyboros-70b-2.2'))
Loading vocab file '/storage/2sata/llama/spicyboros-70b-2.2/tokenizer.model', type 'spm'
Traceback (most recent call last):
  File "./convert.py", line 1208, in <module>
    main()
  File "./convert.py", line 1190, in main
    vocab = load_vocab(vocab_dir, args.vocabtype)
  File "./convert.py", line 1101, in load_vocab
    return SentencePieceVocab(path, added_tokens_path if added_tokens_path.exists() else None)
  File "./convert.py", line 376, in __init__
    self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
TypeError: __init__() takes 1 positional argument but 2 were given
root@ch81:/storage/2sata/llama/llama.cpp# 

I downloaded spicyboros from HF through git/ git lfs, convert.py is at b52b29a.

cebtenzzre commented 1 year ago

Edit: I'll finish tomorrow, it's like 5 in the morning and I can't see what I'm missing here:

Something is wrong with your sentencepiece install. Here's what mine looks like:

$ python3 -m pip show sentencepiece | grep Version
Version: 0.1.99
$ python3 -c 'import sentencepiece; print(sentencepiece.SentencePieceProcessor.__init__)'
<function SentencePieceProcessor.Init at 0x7feeb7464ae0>

python3 -m pip install sentencepiece==0.1.98 should fix it. If not, you may need to python3 -m pip uninstall sentencepiece first.

TheBloke commented 1 year ago

Thanks for looking at this guys.

I tried going back to an earlier commit, August 28th, shortly after GGUFv2 release - commit ebcee207b6058b7f695bb5c203ad87b1066a9790

I made a new FP16 from the convert.py from that commit, and made a new q4_0 of Spicyboros 70B 2.2

And it has exactly the same problem.

So I'm thinking this isn't any new problem caused by a recent commit. It's something broken with GGUF q4_0 only, on specific models only. Which is very weird..

ggerganov commented 1 year ago

I guess Q4_0 is not good for quantizing this model - the weight distribution in the tensors seems unusual.

Here is how the quant histograms look like for vanilla LLaMA v2 70B:

[ 139/ 723]                 blk.15.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 140/ 723]                 blk.15.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[ 141/ 723]                 blk.15.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[ 142/ 723]            blk.15.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[ 143/ 723]               blk.15.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 144/ 723]               blk.15.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 145/ 723]                 blk.15.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 146/ 723]              blk.15.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 147/ 723]               blk.15.ffn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 148/ 723]                 blk.16.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 149/ 723]                 blk.16.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[ 150/ 723]                 blk.16.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 151/ 723]            blk.16.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[ 152/ 723]               blk.16.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 153/ 723]               blk.16.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 154/ 723]                 blk.16.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 155/ 723]              blk.16.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 156/ 723]               blk.16.ffn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 157/ 723]                 blk.17.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 158/ 723]                 blk.17.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[ 159/ 723]                 blk.17.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 160/ 723]            blk.17.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[ 161/ 723]               blk.17.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 162/ 723]               blk.17.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 163/ 723]                 blk.17.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 

Notice the Gaussian-shaped distribution with bin[0] storing the abs(max) of the blocks.

Here is how the histograms look like with spicy bros:

[ 137/ 723]                 blk.15.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.081 0.143 0.090 0.143 0.081 0.074 0.068 0.015 0.043 0.016 
[ 138/ 723]                 blk.15.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.072 0.068 0.015 0.043 0.016 
[ 139/ 723]                 blk.15.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.071 0.079 0.145 0.090 0.145 0.079 0.072 0.069 0.015 0.044 0.016 
[ 140/ 723]            blk.15.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.072 0.070 0.015 0.045 0.016 
[ 141/ 723]               blk.15.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016 
[ 142/ 723]                 blk.15.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016 
[ 143/ 723]               blk.15.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016 
[ 144/ 723]              blk.15.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 145/ 723]               blk.15.ffn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 146/ 723]                 blk.16.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.073 0.081 0.142 0.090 0.142 0.081 0.075 0.067 0.015 0.043 0.016 
[ 147/ 723]                 blk.16.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.073 0.068 0.015 0.043 0.016 
[ 148/ 723]                 blk.16.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.072 0.080 0.143 0.090 0.144 0.080 0.074 0.069 0.014 0.044 0.016 
[ 149/ 723]            blk.16.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.143 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016 
[ 150/ 723]               blk.16.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016 
[ 151/ 723]                 blk.16.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.072 0.079 0.144 0.088 0.144 0.079 0.073 0.069 0.015 0.045 0.016 
[ 152/ 723]               blk.16.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016 
[ 153/ 723]              blk.16.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 154/ 723]               blk.16.ffn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[ 155/ 723]                 blk.17.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.080 0.143 0.089 0.143 0.080 0.074 0.068 0.015 0.043 0.016 
[ 156/ 723]                 blk.17.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.071 0.079 0.145 0.091 0.145 0.079 0.073 0.069 0.015 0.043 0.016 
[ 157/ 723]                 blk.17.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =    16.00 MB ->     4.50 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.145 0.090 0.146 0.078 0.072 0.069 0.015 0.044 0.016 
[ 158/ 723]            blk.17.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   128.00 MB ->    36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016 
[ 159/ 723]               blk.17.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.075 0.083 0.139 0.089 0.139 0.083 0.076 0.067 0.016 0.043 0.017 
[ 160/ 723]                 blk.17.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.073 0.080 0.143 0.088 0.143 0.080 0.074 0.069 0.015 0.044 0.016 
[ 161/ 723]               blk.17.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, quantizing to q4_0 .. size =   448.00 MB ->   126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016 
[ 162/ 723]              blk.17.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB

bin[1] is pretty much empty and there are multiple peaks: bin[2], bin[7] , bin[9], bin[14]

It might be useful to plot the weight distribution in some of the tensors to get a better idea of what is going on. Could be somewhat related to #2421

Would also be interesting to understand what is the specific reason for Q4_0 to break down in such a way for this data, but probably need to implement #2783

TheBloke commented 1 year ago

Ah this is interesting. I recall Jon Durbin telling me that he had implemented a suggestion from Tim Dettmers:

The 70B Jon Durbin models were made with qLoRA. But rather than merging the qLoRA adapter in 16-bit as usual, I believe he first quantised the source weights to 4-bit using BitsAndBytes and then merged the qLoRA in 4-bit, before saving in 16-bit. I then quantised the 16-bit weights as normal.

I believe this is the code Jon used, which is based on Tim's suggestion: https://gist.github.com/ChrisHayduk/1a53463331f52dca205e55982baf9930

In hindsight that seems almost certainly what must be different about Jon's recent 70Bs that's causing GGUF 70B Q4_0 to break.

@jondurbin could you confirm that I'm remembering correctly that you're following this new Tim Dettmers procedure for your 70B models?

Apparently this method will soon be available in HF PEFT, so this practice is going to become commonplace, so this is likely to be an ongoing issue.

I will stop making Q4_0 for 70B Jon Durbin models for now, and keep an eye on this happening for models from other creators too.

jondurbin commented 1 year ago

@jondurbin could you confirm that I'm remembering correctly that you're following this new Tim Dettmers procedure for your 70B models?

Indeed, here's the exact script I used: https://github.com/jondurbin/qlora/blob/main/qmerge.py

Specifically:

python qlora/qmerge.py \
  --base llama-2-70b-hf \
  --peft spicyboros-70b-2.2-checkpoints/checkpoint-750/model_adapter \
  --out spicyboros-70b-2.2

I can upload a non-prequantized merge version too, let me know.

jondurbin commented 1 year ago

Can confirm a regular merge with main llama.cpp works fine with q4_0.

staviq commented 1 year ago

python3 -m pip install sentencepiece==0.1.98 should fix it. If not, you may need to python3 -m pip uninstall sentencepiece first.

@Cebtenzzre It did, than you.

I ran q4_0 quant on spicyboros and got identical checksum as TheBloke. I then went through commits, and resulting q4_0 of that model is broken the same way, all the way to, and including d0cee0d. Commits further down were segfaulting for me at some unaligned ssse ops, so I couldn't test. I went backwards in commits so I was trying to quantize post ggufv2 conversion on pre v2 commit, ignore that part :)

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.