Closed TheBloke closed 5 months ago
Not sure if it's related, but
I checked and it makes no difference.\
is not being processed, form the top of my head I think you need -e
for that
I also noticed token candidates are almost identical for each (next) token:
1st:
[1694552218] top 10 candidates:
[1694552218] - 23747: ' brie' (0,103)
[1694552218] - 30322: ' ◄' (0,070)
[1694552218] - 30777: ' Ý' (0,038)
[1694552218] - 12964: ' iglia' (0,037)
[1694552218] - 7776: ' cab' (0,037)
[1694552218] - 21096: ' genommen' (0,035)
[1694552218] - 25168: ' teck' (0,032)
[1694552218] - 7201: ' gres' (0,031)
[1694552218] - 8749: ' eria' (0,031)
[1694552218] - 13716: ' rob' (0,031)
...
2nd:
[1694552270] top 10 candidates:
[1694552270] - 23747: ' brie' (0,104)
[1694552270] - 30322: ' ◄' (0,071)
[1694552270] - 30777: ' Ý' (0,038)
[1694552270] - 12964: ' iglia' (0,038)
[1694552270] - 7776: ' cab' (0,037)
[1694552270] - 21096: ' genommen' (0,035)
[1694552270] - 25168: ' teck' (0,032)
[1694552270] - 7201: ' gres' (0,032)
[1694552270] - 13716: ' rob' (0,031)
[1694552270] - 8749: ' eria' (0,031)
The model is definitely broken, the ppl of first blocks is [1]221713.5241,[2]187850.4390,[3]177167.9363,
.
I just made another for https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf, this time with commit 4f7cd6ba9c88d3ca9a207b6e04f8b2b1efd707b8
File came out identical - same sha256sum - and of course therefore the same gibberish output.
Very odd!
I don't know if it's of any help, but here's the full log of making the new q4_0, first making the FP16 and the q4_0. The Fp16 we know is fine because all the other quants are fine: https://gist.github.com/TheBloke/6fe3bb4d870e45c97acb71772906caaf#file-quant-spicyboros-q4_0-log
For what it is worth, I looked at the mean, min and max of each tensor and compared it to the Q4_K_S model and I didn't see anything obviously out of place. The tokenizer also looks fine.
That's the only changes in quantize.cpp
in the last week (5 days ago) : https://github.com/ggerganov/llama.cpp/commit/00d62adb79bf914a95fb9a2e8f42f3029e76d62c#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7 ( was that else if
-> if
intended ? )
Other than that all changes were cosmetic, all the way to gguf merge.
So whatever got borked, it's in either of those.
@Cebtenzzre That commit was about gcc warning fixes, and that is a functional change, wasn't that else { if(){} }
supposed to be else if () {}
? https://github.com/ggerganov/llama.cpp/commit/00d62adb79bf914a95fb9a2e8f42f3029e76d62c#diff-6745585c496560d324d1f0d6d77beebcb6dd9c3354bef41ab262535a87a376a7
That commit was about gcc warning fixes, and that is a functional change, wasn't that else { if(){} } supposed to be else if () {} ?
No. If the condition is true, the function returns, so the only way to get to that line is if the condition was false - the 'else' is unnecessary.
I ran quantize
f16->Q4_0 on open-llama-3b-v2-f16 on commits ebcee20 to current, and I got identical checksum every time, so it seems specific to 70b.
Anybody has a link to f16 of any of mentioned models ? I can run a script overnight to find if checksum changes with commits.
Anybody has a link to f16 of any of mentioned models ?
Ones that reproduce the gibberish: https://huggingface.co/jondurbin/spicyboros-70b-2.2 https://huggingface.co/jondurbin/airoboros-l2-70b-2.1-creative
One that was apparently OK on an earlier commit: https://huggingface.co/fangloveskari/ORCA_LLaMA_70B_QLoRA
Yes, I've seen those but aren't they raw f32 ? That's not a problem, it's just with f16 I could run wget && script right now, and with raw I'm gonna have to covert them in the morning and results would be probably tomorrow evening.
Edit: It's not that bad, HF isn't throttling much this time, only 20min download.
Yes, I've seen those but aren't they raw f32 ?
No, 145GiB 70B should be fp16. I think most HF uploads are. Compare to TheBloke/Llama-2-70B-fp16.
Yes, I've seen those but aren't they raw f32 ?
No, 145GiB 70B should be fp16. I think most HF uploads are. Compare to TheBloke/Llama-2-70B-fp16.
Ok.
Edit: I'll finish tomorrow, it's like 5 in the morning and I can't see what I'm missing here:
root@ch81:/storage/2sata/llama/llama.cpp# python3 ./convert.py --outtype f16 --outfile test.gguf "/storage/2sata/llama/spicyboros-70b-2.2/"
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00001-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00001-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00002-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00003-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00004-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00005-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00006-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00007-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00008-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00009-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00010-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00011-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00012-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00013-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00014-of-00015.bin
Loading model file /storage/2sata/llama/spicyboros-70b-2.2/pytorch_model-00015-of-00015.bin
params = Params(n_vocab=32000, n_embd=8192, n_layer=80, n_ctx=4096, n_ff=28672, n_head=64, n_head_kv=8, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/storage/2sata/llama/spicyboros-70b-2.2'))
Loading vocab file '/storage/2sata/llama/spicyboros-70b-2.2/tokenizer.model', type 'spm'
Traceback (most recent call last):
File "./convert.py", line 1208, in <module>
main()
File "./convert.py", line 1190, in main
vocab = load_vocab(vocab_dir, args.vocabtype)
File "./convert.py", line 1101, in load_vocab
return SentencePieceVocab(path, added_tokens_path if added_tokens_path.exists() else None)
File "./convert.py", line 376, in __init__
self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
TypeError: __init__() takes 1 positional argument but 2 were given
root@ch81:/storage/2sata/llama/llama.cpp#
I downloaded spicyboros from HF through git/ git lfs, convert.py
is at b52b29a.
Edit: I'll finish tomorrow, it's like 5 in the morning and I can't see what I'm missing here:
Something is wrong with your sentencepiece install. Here's what mine looks like:
$ python3 -m pip show sentencepiece | grep Version
Version: 0.1.99
$ python3 -c 'import sentencepiece; print(sentencepiece.SentencePieceProcessor.__init__)'
<function SentencePieceProcessor.Init at 0x7feeb7464ae0>
python3 -m pip install sentencepiece==0.1.98
should fix it. If not, you may need to python3 -m pip uninstall sentencepiece
first.
Thanks for looking at this guys.
I tried going back to an earlier commit, August 28th, shortly after GGUFv2 release - commit ebcee207b6058b7f695bb5c203ad87b1066a9790
I made a new FP16 from the convert.py from that commit, and made a new q4_0 of Spicyboros 70B 2.2
And it has exactly the same problem.
So I'm thinking this isn't any new problem caused by a recent commit. It's something broken with GGUF q4_0 only, on specific models only. Which is very weird..
I guess Q4_0
is not good for quantizing this model - the weight distribution in the tensors seems unusual.
Here is how the quant histograms look like for vanilla LLaMA v2 70B:
[ 139/ 723] blk.15.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 140/ 723] blk.15.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 141/ 723] blk.15.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.039 0.025 0.021
[ 142/ 723] blk.15.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 143/ 723] blk.15.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 144/ 723] blk.15.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 145/ 723] blk.15.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 146/ 723] blk.15.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 147/ 723] blk.15.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 148/ 723] blk.16.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 149/ 723] blk.16.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 150/ 723] blk.16.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 151/ 723] blk.16.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 152/ 723] blk.16.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 153/ 723] blk.16.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 154/ 723] blk.16.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 155/ 723] blk.16.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 156/ 723] blk.16.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 157/ 723] blk.17.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 158/ 723] blk.17.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.077 0.056 0.038 0.025 0.020
[ 159/ 723] blk.17.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 160/ 723] blk.17.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021
[ 161/ 723] blk.17.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 162/ 723] blk.17.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 163/ 723] blk.17.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
Notice the Gaussian-shaped distribution with bin[0]
storing the abs(max)
of the blocks.
Here is how the histograms look like with spicy bros:
[ 137/ 723] blk.15.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.081 0.143 0.090 0.143 0.081 0.074 0.068 0.015 0.043 0.016
[ 138/ 723] blk.15.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.072 0.068 0.015 0.043 0.016
[ 139/ 723] blk.15.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.071 0.079 0.145 0.090 0.145 0.079 0.072 0.069 0.015 0.044 0.016
[ 140/ 723] blk.15.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.072 0.070 0.015 0.045 0.016
[ 141/ 723] blk.15.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016
[ 142/ 723] blk.15.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 143/ 723] blk.15.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 144/ 723] blk.15.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 145/ 723] blk.15.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 146/ 723] blk.16.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.073 0.081 0.142 0.090 0.142 0.081 0.075 0.067 0.015 0.043 0.016
[ 147/ 723] blk.16.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.072 0.079 0.145 0.091 0.145 0.079 0.073 0.068 0.015 0.043 0.016
[ 148/ 723] blk.16.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.024 0.063 0.072 0.080 0.143 0.090 0.144 0.080 0.074 0.069 0.014 0.044 0.016
[ 149/ 723] blk.16.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.143 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 150/ 723] blk.16.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.074 0.081 0.141 0.088 0.141 0.081 0.075 0.068 0.015 0.044 0.016
[ 151/ 723] blk.16.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.072 0.079 0.144 0.088 0.144 0.079 0.073 0.069 0.015 0.045 0.016
[ 152/ 723] blk.16.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 153/ 723] blk.16.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 154/ 723] blk.16.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 155/ 723] blk.17.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.043 0.025 0.063 0.073 0.080 0.143 0.089 0.143 0.080 0.074 0.068 0.015 0.043 0.016
[ 156/ 723] blk.17.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.043 0.024 0.063 0.071 0.079 0.145 0.091 0.145 0.079 0.073 0.069 0.015 0.043 0.016
[ 157/ 723] blk.17.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, quantizing to q4_0 .. size = 16.00 MB -> 4.50 MB | hist: 0.044 0.000 0.044 0.025 0.064 0.071 0.078 0.145 0.090 0.146 0.078 0.072 0.069 0.015 0.044 0.016
[ 158/ 723] blk.17.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 128.00 MB -> 36.00 MB | hist: 0.044 0.000 0.045 0.025 0.064 0.071 0.078 0.144 0.088 0.144 0.078 0.073 0.070 0.015 0.045 0.016
[ 159/ 723] blk.17.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.043 0.025 0.062 0.075 0.083 0.139 0.089 0.139 0.083 0.076 0.067 0.016 0.043 0.017
[ 160/ 723] blk.17.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.063 0.073 0.080 0.143 0.088 0.143 0.080 0.074 0.069 0.015 0.044 0.016
[ 161/ 723] blk.17.ffn_down.weight - [28672, 8192, 1, 1], type = f16, quantizing to q4_0 .. size = 448.00 MB -> 126.00 MB | hist: 0.044 0.000 0.044 0.025 0.065 0.071 0.078 0.145 0.089 0.146 0.077 0.072 0.070 0.015 0.045 0.016
[ 162/ 723] blk.17.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
bin[1]
is pretty much empty and there are multiple peaks: bin[2]
, bin[7]
, bin[9]
, bin[14]
It might be useful to plot the weight distribution in some of the tensors to get a better idea of what is going on. Could be somewhat related to #2421
Would also be interesting to understand what is the specific reason for Q4_0
to break down in such a way for this data, but probably need to implement #2783
Ah this is interesting. I recall Jon Durbin telling me that he had implemented a suggestion from Tim Dettmers:
The 70B Jon Durbin models were made with qLoRA. But rather than merging the qLoRA adapter in 16-bit as usual, I believe he first quantised the source weights to 4-bit using BitsAndBytes and then merged the qLoRA in 4-bit, before saving in 16-bit. I then quantised the 16-bit weights as normal.
I believe this is the code Jon used, which is based on Tim's suggestion: https://gist.github.com/ChrisHayduk/1a53463331f52dca205e55982baf9930
In hindsight that seems almost certainly what must be different about Jon's recent 70Bs that's causing GGUF 70B Q4_0 to break.
@jondurbin could you confirm that I'm remembering correctly that you're following this new Tim Dettmers procedure for your 70B models?
Apparently this method will soon be available in HF PEFT, so this practice is going to become commonplace, so this is likely to be an ongoing issue.
I will stop making Q4_0 for 70B Jon Durbin models for now, and keep an eye on this happening for models from other creators too.
@jondurbin could you confirm that I'm remembering correctly that you're following this new Tim Dettmers procedure for your 70B models?
Indeed, here's the exact script I used: https://github.com/jondurbin/qlora/blob/main/qmerge.py
Specifically:
python qlora/qmerge.py \
--base llama-2-70b-hf \
--peft spicyboros-70b-2.2-checkpoints/checkpoint-750/model_adapter \
--out spicyboros-70b-2.2
I can upload a non-prequantized merge version too, let me know.
Can confirm a regular merge with main llama.cpp works fine with q4_0.
python3 -m pip install sentencepiece==0.1.98
should fix it. If not, you may need topython3 -m pip uninstall sentencepiece
first.
@Cebtenzzre It did, than you.
I ran q4_0 quant on spicyboros and got identical checksum as TheBloke.
I then went through commits, and resulting q4_0 of that model is broken the same way, all the way to, and including d0cee0d. Commits further down were segfaulting for me at some unaligned ssse ops, so I couldn't test. I went backwards in commits so I was trying to quantize post ggufv2 conversion on pre v2 commit, ignore that part :)
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi guys
I've just had reports that two specific Q4_0 70B models are outputting gibberish, and I've confirmed the same.
Example file with this issue: https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/main/spicyboros-70b-2.2.Q4_0.gguf Second example, made 12 days ago: https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-Creative-GGUF/blob/main/airoboros-l2-70b-2.1-creative.Q4_0.gguf
I've had no reports of problems with other quants. I've tested Q4_K_M and Q5_0 from the same model and commit, and both were fine.
The Spicyboros bad q4_0 was made with commit d54a402
At first I thought it was a recent problem until I realised there was also a file from 12 days ago with the same issue.
But a 70B q4_0 I made three days ago, with commit 21ac3a1, is fine: https://huggingface.co/TheBloke/ORCA_LLaMA_70B_QLoRA-GGUF/blob/main/orca_llama_70b_qlora.Q4_0.gguf
I notice both broken models were made by Jon Durbin - could there be something in the source model causing this? But only for q4_0? That's weird.
Full iutput when testing Spicyboros 70B Q4_0 70B gguf file (too long to post in one comment!) : https://gist.github.com/TheBloke/b7a45d3e5ff1432f90aa221de6a5fb08#file-q4_0-gibberish-log
Trimmed log: