Closed ggerganov closed 1 year ago
I can take a stab at it. Been meaning to dive deeper into the GGML format.
Since convert.py
only does GGML conversion, and quantize
is called explicitly for quantization. In theory - only convert.py
will need to be modified.
Would an existing model (HF/PyTorch) serve as a good starting point?
Trying to add this to convert.py
may be overkill, and a lot harder than it needs to be. Writing a standalone script would probably be a lot easier.
The easiest to understand description of the file format is probably in the training example here: https://github.com/ggerganov/llama.cpp/blob/41c674161fb2459bdf7806d1eebead15bc5d046e/examples/train-text-from-scratch/train-text-from-scratch.cpp#L2609
@ggerganov why not use the safetensor format? seems way more practical than custom binary ggml formats
@Mistobaan See this note in the spec of the upcoming gguf fileformat gguf.md#why-not-other-formats and PR https://github.com/ggerganov/ggml/pull/302
began a super WIP (not completely functional) attempt at this here will update as I go along piecing together corresponding variables between the two.
@byte-6174 nice - I took a similar approach. Currently finding myself going deep in the rabbit hole of converting llama2.c tensors that use calloc based memory assignment (with no metadata afaik) to ggml_tensor
.
Also I don’t quite understand yet why does the vocab need to be saved in the model, when it is also available in an external file? In any case, I believe llama2.c is using the exact same format for vocab.
I’ll end this update with a few words in the voice of Yoda, “Long journey, it is. Learn, I must.”
re. mapping can someone with more exp with llama.cpp tensors point to how these RoPE tensors should be mapped? Indicated with ?
in mapping md file above ☝️
Not 100%, but I believe these are lookup tables for the RoPE, and are not necessary for llama.cpp.
Right, I looked at llama2.c code and its surely for RoPE, good to know it's not needed for llama.cpp. I can remove it. I'm now wanting to run this model now, any pointers? digging now..
You can run the most basic inference using: ./main -m converted-model.bin
Got it, it runs pretty well!. Perhaps now I can quatize it... Output seem to contain non-english words...humm..
gives 359 tok/sec, vs. llama2.c
s ~100 tok/sec, (perhaps not a fair comparison though).
main: build = 909 (5a87675)
main: seed = 1690733963
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 395.13 MB (+ 3.38 MB per state)
llama_new_context_with_model: kv self size = 3.38 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0
One day, Lily met a Shoggoth anyonefather whilem enjo but gl than the acc sn ahead nap very off ru sc things People table fasterer beet wants hiding anywheresklär customer al Contact round weeks sad sick someone somethingore tagBeil as attached where mine recommend ourselvesanalipirst leako each each that reck deal laterchy pair Then dar all blocking Whenrai accomplished yourself backrain this Hamb dr slehetableView short b looked alsoch goal d who down down downnight check out bird calows helper home walliling patch agree je either of whole cl heavily visited visit p up up up up up off headD vCODEBack Withoutotoss timeys pat On new warm compared corner kitcheniously front раз happenionungsplaceĕiveness belong slower run running toten replace a away, MangWe these mention topakesFound herself down if courageished facts rocksepar hear oneessU as b A meanshed turn I Or laugh save without out was destroyunt doA bur byണ University shop' the chance alone alone alone alone someone particular
llama_print_timings: load time = 115.33 ms
llama_print_timings: sample time = 143.12 ms / 200 runs ( 0.72 ms per token, 1397.48 tokens per second)
llama_print_timings: prompt eval time = 10.21 ms / 12 tokens ( 0.85 ms per token, 1175.20 tokens per second)
llama_print_timings: eval time = 409.92 ms / 199 runs ( 2.06 ms per token, 485.46 tokens per second)
llama_print_timings: total time = 581.63 ms
humm, one other difference for "non-english" words could be that the vocabs are not matching.
Nah something else is wrong. First try adding -eps 1e-5
to match the rms norm implementation
default run above has eps = 1e-6, this 👇 is with 1e-5 as you suggest:
main: build = 909 (5a87675)
main: seed = 1690736050
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-05
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 395.13 MB (+ 3.38 MB per state)
llama_new_context_with_model: kv self size = 3.38 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0
One day, Lily met a Shoggothvenatience quicker d looks on des glassappendoredpping early=( along lap ji our Your back of older shwn which happen out tried valleives world swins stessa chairaringtain takes one vendinganceric butYesmy set choiceveryely Next crack each each that everywhereph front downfжда fought feelions enjoy surezy sho r straightstraßeruityoll fesewer closed two v homes wider restaurant fed goerg free you kA basedShowTherevelopeiestpl too came glolass happ thoughtrefixomyber becomingselves keptored by tw belonged home stad watch hop exchange guys simplest situation Today looked on westWhere the disappawvenite tall sameve 'lowrownyn Tom they visitors card block laughed it continued decision anywhereMe justero fresh squ fastert whoseWhere herself startches overOfass bet'urr meps precedsedoc inspired about describe sharing l saidraiслав between silentselves warm thisEV eight plateing wish knewizeS different pop any anything burn graO response
llama_print_timings: load time = 43.53 ms
llama_print_timings: sample time = 143.49 ms / 200 runs ( 0.72 ms per token, 1393.83 tokens per second)
llama_print_timings: prompt eval time = 5.99 ms / 12 tokens ( 0.50 ms per token, 2003.34 tokens per second)
llama_print_timings: eval time = 566.60 ms / 199 runs ( 2.85 ms per token, 351.22 tokens per second)
llama_print_timings: total time = 734.73 ms
I'm printing the model->norm
that is saved in ggml vs. w->rms_final_weight
from llama2.c and they seems to match.
llama2.c first 5 elements of w->rms_final_weight >>
7.676849 7.187980 9.270302 6.815886 7.080070
vs. ggml's model-> norm >>
7.676849 7.187980 9.270302 6.815886 7.080070
We haven't ran any F32 models with llama.cpp
yet, so it is possible that there is a bug that we haven't observed yet only for F32 format. To clear this possibility, try to convert the model to F16 with the following command:
./quantize abcd1.bin abcd1-f16.bin f16
And see if the new abcd1-f16.bin
model also outputs nonsense.
yes, still nonsense. I'm currently investigating how the ff weights w1, w2, w3
are laid out in memory.
in llama2.c we have:
w1 --- layer x hiddden_dim x dim
w2 --- layer x dim x hiddden_dim
w3 --- layer x hiddden_dim x dim
looking more to see if I'm making a mistake in putting them in the tensors in the right order...
aah! I found a bug. I was not using the right multiplier to convert the 1D arrays in llama2.c
to reshape into 2D arrays in ggml
!
It looks much better and comparable to what we get in llama2.c
output!
(py38) ➜ llama.cpp git:(master) ✗ ./main -m abcd1.bin -p "One day, Lily met a Shoggoth" -n 200
main: build = 909 (5a87675)
main: seed = 1690770877
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 395.13 MB (+ 3.38 MB per state)
llama_new_context_with_model: kv self size = 3.38 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0
One day, Lily met a Shoggoth. It was big and shiny and very original. Lily asked the Shoggamkeeper what it was.
"It's a special kind of toy," the Shter master said. "But it's mine, not yours."
Lily touched the shirt and said, "Please take care of it?"
The Shapeububber was very happy to have this special toy. He granted Lily some money and told her to do as he asked.
Lily thanked him and took the shirt home with her. She looked at it and saw that it had a big number 1 on it. She was so excited!
"Thank you for saving me," she said to the Shapehanger. "I will take good care of this toy from now on."
The Shapehub smiled as he watched Lily keep her special toy. Once upon a time, there was a little girl named L
llama_print_timings: load time = 64.88 ms
llama_print_timings: sample time = 143.15 ms / 200 runs ( 0.72 ms per token, 1397.16 tokens per second)
llama_print_timings: prompt eval time = 9.08 ms / 12 tokens ( 0.76 ms per token, 1321.59 tokens per second)
llama_print_timings: eval time = 340.74 ms / 199 runs ( 1.71 ms per token, 584.03 tokens per second)
llama_print_timings: total time = 510.47 ms
and here with quantization:
(py38) ➜ llama.cpp git:(master) ✗ ./main -m abcd1-f16.bin -p "One day, Lily met a Shoggoth" -n 500
main: build = 909 (5a87675)
main: seed = 1690770940
llama.cpp: loading model from abcd1-f16.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 348.58 MB (+ 3.38 MB per state)
llama_new_context_with_model: kv self size = 3.38 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 500, n_keep = 0
One day, Lily met a Shoggoth. It was very small and shiny and had many buttons on it. Lily liked the shirt and smiled at the shaving Shog.
"Wow," she said. "That's an unusual shirt. Can I try it?"
The Shapey said, "No, this is my shirt. You can't touch it. It's mine."
Lily was sad and angry. She tried to take the shirt from the Shoggborn. The Shogocin fell off the shirt and rolled away. Lily chased after him.
"Stop, Shog," she said. "You are mean. You can't have my shirt."
The Shogy heard her and felt bad. He got up from his bed and walked to Lily. He licked her face and wagged his tail.
Lily was happy and surprised. She hugged the Shogen and said, "Thank you for being my friend. You are very nice."
The Shirt gasped and smiled. It said, "You're welcome. I'm glad you like it. But now, let's go back to your shirt. It has an unusual pattern on it. Do you know what that means?"
Lily looked at the label. It was a bit strange. She did not know what that meant, but she said, "Thank you."
The Shoggrow on its skin and tail. The shirt is very funny. But no one looks like a monster. Everyone looks different. Lily is a lot of their mom. She is their mom'mightyious. They were her dull-iky.
"This one-shard. Shyebe was ankant. Shady. She shaped with shaped. Once upon skin. Heelfully she had three arms. The shirt. She was very ador, itchy. "I amisraichy shiny. Itchy Beary Cariosiness was ugly. icy.
M-els belonged toys. Shady things. Sharing and shirt. Sharpylyaterighter. Her nameate. py. icy eyes. Heby. The shady face.
Shutry. It.
Shady.
The Shy Shadow
llama_print_timings: load time = 68.85 ms
llama_print_timings: sample time = 357.98 ms / 500 runs ( 0.72 ms per token, 1396.71 tokens per second)
llama_print_timings: prompt eval time = 4.65 ms / 12 tokens ( 0.39 ms per token, 2579.54 tokens per second)
llama_print_timings: eval time = 698.48 ms / 499 runs ( 1.40 ms per token, 714.41 tokens per second)
llama_print_timings: total time = 1106.33 ms
Great! You should use context size of 256: -c 256
to match the OG model. Also can try Q8_0 quantisation. And don't forget the epsilon
sure-
(py38) ➜ llama.cpp git:(master) ✗ ./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 909 (5a87675)
main: seed = 1690809457
llama.cpp: loading model from abcd1-Q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 256
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-05
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 310.76 MB (+ 1.69 MB per state)
llama_new_context_with_model: kv self size = 1.69 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0
One day, Lily met a Shoggoth comy all all again be at-its,- for the me and with here long, and and b and away and alert out to in your away in so alone d and very fast very hard happy you me me me me me off me me I alone only time and fast fast on his a on the mention fast and whoions followswards fast roomsaker touch carefulroom learned work the a with a an a her happ “ the the the back himself his quickly andum the the the the his cast the over to child while grass he me together fast the in after and and me firsts away andocim dust the her at Princess.ch," again before you' upon- the it you’- she before. in all a a a a a her a, at to a a long away fast them a very very very so once and to and very far right to- a a her me me me me really herchow long hard alone alone alone in so from me me out out fast fast the it a very very very with in farly bigger steals water and pain away one the the too about fish his care revvel th raw in his firsts first by into life to it upon for and- the no time c a tooowow,
llama_print_timings: load time = 60.02 ms
llama_print_timings: sample time = 354.44 ms / 500 runs ( 0.71 ms per token, 1410.66 tokens per second)
llama_print_timings: prompt eval time = 66.42 ms / 399 tokens ( 0.17 ms per token, 6007.41 tokens per second)
llama_print_timings: eval time = 560.68 ms / 496 runs ( 1.13 ms per token, 884.64 tokens per second)
llama_print_timings: total time = 1025.19 ms
just focusing on timing a bit, seems with -t 4
instead of the default 8
seems better :)
command:
./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 200 -c 256 -eps 1e-5 -t 4
model | time |
---|---|
abcd1.bin | 392.16ms |
abcd1-f16.bin | 312.78ms |
abcd1-Q8_0.bin | 253.47ms |
The Q8_0 generation looks broken. Either it does not have enough precision somehow or there is still some lingering issue
humm, you mean as far as we can judge from the output, some words look random...yes?
also, - perhaps not relevant, perhaps it is - but llama2.c
uses the RoPE vectors which we are ignoring, so there is that difference.
The F16 output seems OK up to 256 tokens which means it's probably not related to RoPE.
Why not run a hellaswag test on the model to compare with other models? See https://github.com/ggerganov/llama.cpp/discussions/2321
will take a look. need to do more work it appears - as quantization is good at uncovering bugs 😄
It would be interesting to see the scores of different quantization levels of such small model. It is the stoies15M model, right?
yes, 15M. there are also 42M and 110M.
While looking into quantized performance, I found that the 42M model doesn't conform to following n_ff
formula.
const uint32_t n_ff = ((2*(4*hparams->n_embd)/3 + hparams->n_mult - 1)/hparams->n_mult)*hparams->n_mult;
I can only guess because Andrej found that size doesn't train a decent model.?!
for the other 2 models 15M and 110M, the sizes are okay.
@klosax, I was checking the scores for hellaswag and didn't find that command-line option for ./perplexity
? - Do I need to checkout some other branch?!
It should work with the latest release.
not sure what to make of this, but it doesnt print any score. I am running:
./perplexity -m ~/Projects/llama/llama.cpp.fork/llama.cpp/abcd1.bin -f hellaswag_val_full.txt
this is what it prints:
main: build = 899 (41c6741)
main: seed = 1690856074
llama.cpp: loading model from /Users/aniket/Projects/llama/llama.cpp.fork/llama.cpp/abcd1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 288
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 6
llama_model_load_internal: n_head_kv = 6
llama_model_load_internal: n_layer = 6
llama_model_load_internal: n_rot = 48
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 768
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.02 MB
llama_model_load_internal: mem required = 395.13 MB (+ 3.38 MB per state)
llama_new_context_with_model: kv self size = 3.38 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3852 chunks, batch_size=512
perplexity: 0.12 seconds per pass - ETA 7 minutes
[1]439.2630,[2]397.4015,[3]482.3224,[4]489.5313,[5]
and then
64.6106,[3850]764.7539,[3851]764.6983,[3852]764.7153,
llama_print_timings: load time = 3901.01 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 382963.39 ms / 1972224 tokens ( 0.19 ms per token, 5149.90 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 577845.67 ms
I have been trying to look into the reasons behind the discrepancy between the results of quantization with the 2 models' 15M and 110M parameters.
It seems for the 110M model the 8 and 4-bit quantization results are really good, whereas for the 15M model, the results show bad outputs.
I'm thinking that given this small number of parameters for the 15M model - the quantization degrades the performance way too much. Not sure what else we can attribute this to.
Has any one done any performance degradation comparison for as small a model as the 15M one before?
If anything works on the original TinyStories models on HF, there are some smaller models there you could try. I would think some flavour of GGML or GPTQ supports them, at the very least, load_in_4bit
.
Can you post the full output of the quantise command for the 15M model F32 -> Q8_0?
here:
(py38) ➜ llama.cpp git:(master) ✗ ./quantize abcd1.bin abcd1-q8_0.bin Q8_0
main: build = 911 (f1c03f4)
main: quantizing 'abcd1.bin' to 'abcd1-q8_0.bin' as Q8_0
llama.cpp: loading model from abcd1.bin
llama.cpp: saving model to abcd1-q8_0.bin
[ 1/ 57] tok_embeddings.weight - 288 x 32000, type = f32, quantizing to q8_0 .. size = 35.16 MB -> 9.34 MB | hist: 0.000 0.023 0.012 0.017 0.020 0.036 0.061 0.107 0.400 0.125 0.063 0.045 0.023 0.019 0.023 0.028
[ 2/ 57] norm.weight - 288, type = f32, size = 0.001 MB
[ 3/ 57] output.weight - 288 x 32000, type = f32, quantizing to q8_0 .. size = 35.16 MB -> 9.34 MB | hist: 0.000 0.023 0.012 0.017 0.020 0.036 0.061 0.107 0.400 0.125 0.063 0.045 0.023 0.019 0.023 0.028
[ 4/ 57] layers.0.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 5/ 57] layers.0.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.026 0.020 0.030 0.047 0.067 0.087 0.106 0.227 0.108 0.088 0.067 0.047 0.031 0.020 0.028
[ 6/ 57] layers.0.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.019 0.031 0.045 0.066 0.087 0.108 0.228 0.108 0.088 0.066 0.048 0.031 0.020 0.028
[ 7/ 57] layers.0.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.031 0.046 0.065 0.086 0.107 0.234 0.107 0.090 0.064 0.046 0.031 0.019 0.027
[ 8/ 57] layers.0.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.026 0.018 0.029 0.046 0.065 0.088 0.109 0.238 0.111 0.088 0.065 0.044 0.029 0.018 0.026
[ 9/ 57] layers.0.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 10/ 57] layers.0.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.028 0.020 0.032 0.049 0.069 0.089 0.106 0.223 0.105 0.088 0.066 0.047 0.032 0.020 0.027
[ 11/ 57] layers.0.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.019 0.032 0.047 0.066 0.087 0.107 0.228 0.107 0.088 0.068 0.047 0.032 0.020 0.027
[ 12/ 57] layers.0.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.031 0.047 0.066 0.087 0.106 0.227 0.107 0.088 0.067 0.048 0.032 0.020 0.028
[ 13/ 57] layers.1.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 14/ 57] layers.1.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.021 0.031 0.047 0.067 0.087 0.105 0.229 0.106 0.089 0.066 0.046 0.031 0.020 0.028
[ 15/ 57] layers.1.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.087 0.106 0.226 0.106 0.088 0.067 0.047 0.033 0.020 0.026
[ 16/ 57] layers.1.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.028 0.019 0.031 0.045 0.066 0.087 0.106 0.230 0.106 0.089 0.067 0.048 0.032 0.020 0.026
[ 17/ 57] layers.1.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.019 0.030 0.046 0.066 0.088 0.109 0.231 0.107 0.089 0.066 0.047 0.031 0.018 0.026
[ 18/ 57] layers.1.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 19/ 57] layers.1.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.068 0.089 0.106 0.223 0.106 0.088 0.068 0.047 0.031 0.020 0.027
[ 20/ 57] layers.1.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.087 0.106 0.227 0.107 0.087 0.066 0.049 0.031 0.020 0.027
[ 21/ 57] layers.1.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.068 0.088 0.106 0.223 0.106 0.088 0.066 0.048 0.032 0.020 0.027
[ 22/ 57] layers.2.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 23/ 57] layers.2.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.066 0.088 0.106 0.228 0.107 0.088 0.067 0.047 0.031 0.020 0.027
[ 24/ 57] layers.2.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.067 0.087 0.106 0.227 0.105 0.088 0.066 0.048 0.032 0.020 0.028
[ 25/ 57] layers.2.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.065 0.088 0.105 0.231 0.108 0.088 0.066 0.047 0.031 0.020 0.027
[ 26/ 57] layers.2.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.031 0.049 0.067 0.088 0.108 0.226 0.106 0.086 0.067 0.047 0.032 0.020 0.027
[ 27/ 57] layers.2.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 28/ 57] layers.2.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.021 0.032 0.048 0.068 0.088 0.106 0.223 0.104 0.087 0.067 0.048 0.033 0.021 0.028
[ 29/ 57] layers.2.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.026 0.020 0.031 0.046 0.066 0.088 0.107 0.228 0.106 0.087 0.067 0.048 0.031 0.020 0.028
[ 30/ 57] layers.2.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.067 0.088 0.105 0.224 0.105 0.088 0.068 0.049 0.032 0.020 0.027
[ 31/ 57] layers.3.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 32/ 57] layers.3.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.029 0.020 0.032 0.048 0.068 0.086 0.104 0.225 0.105 0.088 0.068 0.047 0.031 0.021 0.028
[ 33/ 57] layers.3.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.019 0.032 0.048 0.067 0.086 0.107 0.226 0.107 0.088 0.066 0.049 0.032 0.019 0.028
[ 34/ 57] layers.3.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.029 0.021 0.031 0.048 0.066 0.088 0.106 0.229 0.106 0.088 0.066 0.048 0.031 0.020 0.026
[ 35/ 57] layers.3.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.033 0.048 0.067 0.090 0.104 0.226 0.106 0.087 0.066 0.048 0.032 0.019 0.027
[ 36/ 57] layers.3.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 37/ 57] layers.3.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.088 0.105 0.223 0.105 0.089 0.068 0.048 0.032 0.020 0.028
[ 38/ 57] layers.3.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.068 0.087 0.106 0.227 0.105 0.088 0.067 0.048 0.032 0.020 0.027
[ 39/ 57] layers.3.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.066 0.088 0.106 0.223 0.106 0.088 0.067 0.048 0.032 0.020 0.027
[ 40/ 57] layers.4.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 41/ 57] layers.4.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.019 0.031 0.049 0.068 0.088 0.105 0.227 0.106 0.089 0.066 0.047 0.031 0.020 0.027
[ 42/ 57] layers.4.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.028 0.019 0.031 0.047 0.065 0.089 0.105 0.230 0.107 0.088 0.067 0.048 0.032 0.019 0.026
[ 43/ 57] layers.4.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.019 0.032 0.048 0.066 0.086 0.106 0.231 0.107 0.089 0.065 0.047 0.031 0.019 0.028
[ 44/ 57] layers.4.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.026 0.020 0.031 0.048 0.065 0.087 0.105 0.226 0.108 0.089 0.068 0.048 0.032 0.020 0.028
[ 45/ 57] layers.4.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 46/ 57] layers.4.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.106 0.223 0.106 0.089 0.067 0.047 0.032 0.020 0.027
[ 47/ 57] layers.4.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.086 0.106 0.225 0.106 0.088 0.068 0.047 0.032 0.020 0.027
[ 48/ 57] layers.4.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.105 0.224 0.107 0.087 0.067 0.048 0.032 0.020 0.028
[ 49/ 57] layers.5.attention_norm.weight - 288, type = f32, size = 0.001 MB
[ 50/ 57] layers.5.attention.wq.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.088 0.105 0.227 0.104 0.088 0.068 0.048 0.031 0.020 0.028
[ 51/ 57] layers.5.attention.wk.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.026 0.020 0.032 0.047 0.067 0.086 0.107 0.226 0.110 0.087 0.066 0.048 0.031 0.019 0.028
[ 52/ 57] layers.5.attention.wv.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.028 0.019 0.031 0.048 0.066 0.088 0.109 0.227 0.106 0.087 0.067 0.047 0.031 0.019 0.027
[ 53/ 57] layers.5.attention.wo.weight - 288 x 288, type = f32, quantizing to q8_0 .. size = 0.32 MB -> 0.08 MB | hist: 0.000 0.026 0.020 0.031 0.047 0.066 0.087 0.108 0.225 0.106 0.089 0.068 0.047 0.031 0.020 0.028
[ 54/ 57] layers.5.ffn_norm.weight - 288, type = f32, size = 0.001 MB
[ 55/ 57] layers.5.feed_forward.w1.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.106 0.224 0.105 0.089 0.067 0.047 0.031 0.020 0.027
[ 56/ 57] layers.5.feed_forward.w2.weight - 768 x 288, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.019 0.030 0.046 0.066 0.087 0.108 0.234 0.107 0.088 0.065 0.047 0.030 0.019 0.027
[ 57/ 57] layers.5.feed_forward.w3.weight - 288 x 768, type = f32, quantizing to q8_0 .. size = 0.84 MB -> 0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.068 0.087 0.106 0.226 0.106 0.087 0.068 0.048 0.032 0.020 0.027
llama_model_quantize_internal: model size = 93.11 MB
llama_model_quantize_internal: quant size = 24.74 MB
llama_model_quantize_internal: hist: 0.000 0.024 0.014 0.021 0.026 0.044 0.067 0.107 0.357 0.121 0.069 0.050 0.029 0.022 0.022 0.028
main: quantize time = 63.48 ms
main: total time = 63.48 ms
Perplexity over wiki.test.raw (ctx512/batch512) | F32 | F16 | Q8_0 | Q5_1 | Q4_0 | Q2_K |
---|---|---|---|---|---|---|
stories-15M | 8985.72 | 8983.72 | 8957.02 | 9229.97 | 9780.31 | n/a |
stories-42M | 257128.74 | |||||
stories-110M | 1815.13 | 1818.58 | 1814.13 | 1827.72 | 1887.04 | 2863.20 |
The 42M model is broken.
humm, curious what was not doing?!
yes, the 42M model has n_ff
that doesn't conform to ggml formula mentioned above.
also, what do i compare those numbers to?!
Perplexity measures how good the model is at predicting the contents of a dataset. A lower number is better. See https://github.com/ggerganov/llama.cpp#quantization
Ah, I see, so we look at the differential between F32 and the quantized models. i.e. for 15M - we have F16 (and Q8?) performing near F32 but Q5, Q4 are too far. and for 110M - we have all thru Q4 working okay when compared to its F32 counterpart.
but we have no way to tell how good/bad the converted F32 model itself is in the first place, right?
Also, can you please specify how I can generate these numbers - what in my command above is missing?!
Yes the numbers for different quantization levels are nearly the same.
The dataset is the file wiki.test.raw
extracted from the the zip file https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip?ref=blog.salesforceairesearch.com
To run the test use:
./perplexity -c 512 -b 512 -f wiki.test.raw -m model.bin
when finished the last number is the measured perplexity over the whole file.
You can compare output from the F32 and the original llama2.c output to see if it is different. Use exact same random seed and prompt.
What I can tell is that the output from both 15M and 110M ggml models looks as expected from such small models.
My mistake, you should use ctx 256 for these small models:
./perplexity -c 256 -b 256 -f wiki.test.raw -m model.bin
still no score, im now running:
./perplexity -m ~/Projects/llama/llama.cpp.fork/llama.cpp/abcd3-Q8_0.bin -f wiki.test.tokens -c 256 -b 256
The datafile should be wiki.test.raw
not wiki.test.tokens
?
Edit: Checked, and the file is renamed.
the zip from above url only 3 files all ending with .tokens
. But I found another file that is named wikitxt-2-raw.zip
, usning the raw file from it now, 7 mins.
I found the problem - All the models should have n_mult = 32
42M model works!
The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon.
We should provide a simple conversion tool from
llama2.c
bin format toggml
format so we can run inference of the models inllama.cpp
Great task for people looking to get involved in the project