ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.45k stars 9.68k forks source link

llama : add support for llama2.c models #2379

Closed ggerganov closed 1 year ago

ggerganov commented 1 year ago

The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon.

We should provide a simple conversion tool from llama2.c bin format to ggml format so we can run inference of the models in llama.cpp

Great task for people looking to get involved in the project

jagtesh commented 1 year ago

I can take a stab at it. Been meaning to dive deeper into the GGML format.

Since convert.py only does GGML conversion, and quantize is called explicitly for quantization. In theory - only convert.py will need to be modified.

Would an existing model (HF/PyTorch) serve as a good starting point?

slaren commented 1 year ago

Trying to add this to convert.py may be overkill, and a lot harder than it needs to be. Writing a standalone script would probably be a lot easier.

The easiest to understand description of the file format is probably in the training example here: https://github.com/ggerganov/llama.cpp/blob/41c674161fb2459bdf7806d1eebead15bc5d046e/examples/train-text-from-scratch/train-text-from-scratch.cpp#L2609

Mistobaan commented 1 year ago

@ggerganov why not use the safetensor format? seems way more practical than custom binary ggml formats

klosax commented 1 year ago

@Mistobaan See this note in the spec of the upcoming gguf fileformat gguf.md#why-not-other-formats and PR https://github.com/ggerganov/ggml/pull/302

byte-6174 commented 1 year ago

began a super WIP (not completely functional) attempt at this here will update as I go along piecing together corresponding variables between the two.

jagtesh commented 1 year ago

@byte-6174 nice - I took a similar approach. Currently finding myself going deep in the rabbit hole of converting llama2.c tensors that use calloc based memory assignment (with no metadata afaik) to ggml_tensor.

Also I don’t quite understand yet why does the vocab need to be saved in the model, when it is also available in an external file? In any case, I believe llama2.c is using the exact same format for vocab.

I’ll end this update with a few words in the voice of Yoda, “Long journey, it is. Learn, I must.”

byte-6174 commented 1 year ago

Here. also, find a mapping that was required to figure out how to match the variables 🙂

This reads the llama2.c model file and saves all weights to ggml compatible tensor format. Let me know how it works...

byte-6174 commented 1 year ago

re. mapping can someone with more exp with llama.cpp tensors point to how these RoPE tensors should be mapped? Indicated with ? in mapping md file above ☝️

slaren commented 1 year ago

Not 100%, but I believe these are lookup tables for the RoPE, and are not necessary for llama.cpp.

byte-6174 commented 1 year ago

Right, I looked at llama2.c code and its surely for RoPE, good to know it's not needed for llama.cpp. I can remove it. I'm now wanting to run this model now, any pointers? digging now..

ggerganov commented 1 year ago

You can run the most basic inference using: ./main -m converted-model.bin

byte-6174 commented 1 year ago

Got it, it runs pretty well!. Perhaps now I can quatize it... Output seem to contain non-english words...humm..

gives 359 tok/sec, vs. llama2.cs ~100 tok/sec, (perhaps not a fair comparison though).


main: build = 909 (5a87675)
main: seed  = 1690733963
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0

 One day, Lily met a Shoggoth anyonefather whilem enjo but gl than the acc sn ahead nap very off ru sc things People table fasterer beet wants hiding anywheresklär customer al Contact round weeks sad sick someone somethingore tagBeil as attached where mine recommend ourselvesanalipirst leako each each that reck deal laterchy pair Then dar all blocking Whenrai accomplished yourself backrain this Hamb dr slehetableView short b looked alsoch goal d who down down downnight check out bird calows helper home walliling patch agree je either of whole cl heavily visited visit p up up up up up off headD vCODEBack Withoutotoss timeys pat On new warm compared corner kitcheniously front раз happenionungsplaceĕiveness belong slower run running toten replace a away, MangWe these mention topakesFound herself down if courageished facts rocksepar hear oneessU as b A meanshed turn I Or laugh save without out was destroyunt doA bur byണ University shop' the chance alone alone alone alone someone particular
llama_print_timings:        load time =   115.33 ms
llama_print_timings:      sample time =   143.12 ms /   200 runs   (    0.72 ms per token,  1397.48 tokens per second)
llama_print_timings: prompt eval time =    10.21 ms /    12 tokens (    0.85 ms per token,  1175.20 tokens per second)
llama_print_timings:        eval time =   409.92 ms /   199 runs   (    2.06 ms per token,   485.46 tokens per second)
llama_print_timings:       total time =   581.63 ms
byte-6174 commented 1 year ago

humm, one other difference for "non-english" words could be that the vocabs are not matching.

ggerganov commented 1 year ago

Nah something else is wrong. First try adding -eps 1e-5 to match the rms norm implementation

byte-6174 commented 1 year ago

default run above has eps = 1e-6, this 👇 is with 1e-5 as you suggest:


main: build = 909 (5a87675)
main: seed  = 1690736050
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0

 One day, Lily met a Shoggothvenatience quicker d looks on des glassappendoredpping early=( along lap ji our Your back of older shwn which happen out tried valleives world swins stessa chairaringtain takes one vendinganceric butYesmy set choiceveryely Next crack each each that everywhereph front downfжда fought feelions enjoy surezy sho r straightstraßeruityoll fesewer closed two v homes wider restaurant fed goerg free you kA basedShowTherevelopeiestpl too came glolass happ thoughtrefixomyber becomingselves keptored by tw belonged home stad watch hop exchange guys simplest situation Today looked on westWhere the disappawvenite tall sameve 'lowrownyn Tom they visitors card block laughed it continued decision anywhereMe justero fresh squ fastert whoseWhere herself startches overOfass bet'urr meps precedsedoc inspired about describe sharing l saidraiслав between silentselves warm thisEV eight plateing wish knewizeS different pop any anything burn graO response
llama_print_timings:        load time =    43.53 ms
llama_print_timings:      sample time =   143.49 ms /   200 runs   (    0.72 ms per token,  1393.83 tokens per second)
llama_print_timings: prompt eval time =     5.99 ms /    12 tokens (    0.50 ms per token,  2003.34 tokens per second)
llama_print_timings:        eval time =   566.60 ms /   199 runs   (    2.85 ms per token,   351.22 tokens per second)
llama_print_timings:       total time =   734.73 ms
byte-6174 commented 1 year ago

I'm printing the model->norm that is saved in ggml vs. w->rms_final_weight from llama2.c and they seems to match.

llama2.c first 5 elements of w->rms_final_weight >> 7.676849 7.187980 9.270302 6.815886 7.080070

vs. ggml's model-> norm >> 7.676849 7.187980 9.270302 6.815886 7.080070

ggerganov commented 1 year ago

We haven't ran any F32 models with llama.cpp yet, so it is possible that there is a bug that we haven't observed yet only for F32 format. To clear this possibility, try to convert the model to F16 with the following command:

./quantize abcd1.bin abcd1-f16.bin f16

And see if the new abcd1-f16.bin model also outputs nonsense.

byte-6174 commented 1 year ago

yes, still nonsense. I'm currently investigating how the ff weights w1, w2, w3 are laid out in memory. in llama2.c we have: w1 --- layer x hiddden_dim x dim w2 --- layer x dim x hiddden_dim w3 --- layer x hiddden_dim x dim

looking more to see if I'm making a mistake in putting them in the tensors in the right order...

byte-6174 commented 1 year ago

aah! I found a bug. I was not using the right multiplier to convert the 1D arrays in llama2.c to reshape into 2D arrays in ggml! It looks much better and comparable to what we get in llama2.c output!

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1.bin -p "One day, Lily met a Shoggoth" -n 200
main: build = 909 (5a87675)
main: seed  = 1690770877
llama.cpp: loading model from abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 200, n_keep = 0

 One day, Lily met a Shoggoth. It was big and shiny and very original. Lily asked the Shoggamkeeper what it was.
"It's a special kind of toy," the Shter master said. "But it's mine, not yours."
Lily touched the shirt and said, "Please take care of it?"
The Shapeububber was very happy to have this special toy. He granted Lily some money and told her to do as he asked.
Lily thanked him and took the shirt home with her. She looked at it and saw that it had a big number 1 on it. She was so excited!
"Thank you for saving me," she said to the Shapehanger. "I will take good care of this toy from now on."
The Shapehub smiled as he watched Lily keep her special toy. Once upon a time, there was a little girl named L
llama_print_timings:        load time =    64.88 ms
llama_print_timings:      sample time =   143.15 ms /   200 runs   (    0.72 ms per token,  1397.16 tokens per second)
llama_print_timings: prompt eval time =     9.08 ms /    12 tokens (    0.76 ms per token,  1321.59 tokens per second)
llama_print_timings:        eval time =   340.74 ms /   199 runs   (    1.71 ms per token,   584.03 tokens per second)
llama_print_timings:       total time =   510.47 ms
byte-6174 commented 1 year ago

and here with quantization:

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-f16.bin -p "One day, Lily met a Shoggoth" -n 500
main: build = 909 (5a87675)
main: seed  = 1690770940
llama.cpp: loading model from abcd1-f16.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  348.58 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 500, n_keep = 0

 One day, Lily met a Shoggoth. It was very small and shiny and had many buttons on it. Lily liked the shirt and smiled at the shaving Shog.
"Wow," she said. "That's an unusual shirt. Can I try it?"
The Shapey said, "No, this is my shirt. You can't touch it. It's mine."
Lily was sad and angry. She tried to take the shirt from the Shoggborn. The Shogocin fell off the shirt and rolled away. Lily chased after him.
"Stop, Shog," she said. "You are mean. You can't have my shirt."
The Shogy heard her and felt bad. He got up from his bed and walked to Lily. He licked her face and wagged his tail.
Lily was happy and surprised. She hugged the Shogen and said, "Thank you for being my friend. You are very nice."
The Shirt gasped and smiled. It said, "You're welcome. I'm glad you like it. But now, let's go back to your shirt. It has an unusual pattern on it. Do you know what that means?"
Lily looked at the label. It was a bit strange. She did not know what that meant, but she said, "Thank you."
The Shoggrow on its skin and tail. The shirt is very funny. But no one looks like a monster. Everyone looks different. Lily is a lot of their mom. She is their mom'mightyious. They were her dull-iky.
"This one-shard. Shyebe was ankant. Shady. She shaped with shaped. Once upon skin. Heelfully she had three arms. The shirt. She was very ador, itchy. "I amisraichy shiny. Itchy Beary Cariosiness was ugly. icy.
M-els belonged toys. Shady things. Sharing and shirt. Sharpylyaterighter. Her nameate. py. icy eyes. Heby. The shady face.
Shutry. It.
Shady.
The Shy Shadow
llama_print_timings:        load time =    68.85 ms
llama_print_timings:      sample time =   357.98 ms /   500 runs   (    0.72 ms per token,  1396.71 tokens per second)
llama_print_timings: prompt eval time =     4.65 ms /    12 tokens (    0.39 ms per token,  2579.54 tokens per second)
llama_print_timings:        eval time =   698.48 ms /   499 runs   (    1.40 ms per token,   714.41 tokens per second)
llama_print_timings:       total time =  1106.33 ms
ggerganov commented 1 year ago

Great! You should use context size of 256: -c 256 to match the OG model. Also can try Q8_0 quantisation. And don't forget the epsilon

byte-6174 commented 1 year ago

sure-

(py38) ➜  llama.cpp git:(master) ✗ ./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256 -eps 1e-5
main: build = 909 (5a87675)
main: seed  = 1690809457
llama.cpp: loading model from abcd1-Q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 256
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-05
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  310.76 MB (+    1.69 MB per state)
llama_new_context_with_model: kv self size  =    1.69 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 256, n_batch = 512, n_predict = 500, n_keep = 0

 One day, Lily met a Shoggoth comy all all again be at-its,- for the me and with here long, and and b and away and alert out to in your away in so alone d and very fast very hard happy you me me me me me off me me I alone only time and fast fast on his a on the mention fast and whoions followswards fast roomsaker touch carefulroom learned work the a with a an a her happ “ the the the back himself his quickly andum the the the the his cast the over to child while grass he me together fast the in after and and me firsts away andocim dust the her at Princess.ch," again before you' upon- the it you’- she before. in all a a a a a her a, at to a a long away fast them a very very very so once and to and very far right to- a a her me me me me really herchow long hard alone alone alone in so from me me out out fast fast the it a  very very very with in farly bigger steals water and pain away one the the too about fish his care revvel th raw in his firsts first by into life to it upon for and- the no time c a tooowow,
llama_print_timings:        load time =    60.02 ms
llama_print_timings:      sample time =   354.44 ms /   500 runs   (    0.71 ms per token,  1410.66 tokens per second)
llama_print_timings: prompt eval time =    66.42 ms /   399 tokens (    0.17 ms per token,  6007.41 tokens per second)
llama_print_timings:        eval time =   560.68 ms /   496 runs   (    1.13 ms per token,   884.64 tokens per second)
llama_print_timings:       total time =  1025.19 ms
byte-6174 commented 1 year ago

just focusing on timing a bit, seems with -t 4 instead of the default 8 seems better :) command: ./main -m abcd1-Q8_0.bin -p "One day, Lily met a Shoggoth" -n 200 -c 256 -eps 1e-5 -t 4

model time
abcd1.bin 392.16ms
abcd1-f16.bin 312.78ms
abcd1-Q8_0.bin 253.47ms
ggerganov commented 1 year ago

The Q8_0 generation looks broken. Either it does not have enough precision somehow or there is still some lingering issue

byte-6174 commented 1 year ago

humm, you mean as far as we can judge from the output, some words look random...yes?

byte-6174 commented 1 year ago

also, - perhaps not relevant, perhaps it is - but llama2.c uses the RoPE vectors which we are ignoring, so there is that difference.

ggerganov commented 1 year ago

The F16 output seems OK up to 256 tokens which means it's probably not related to RoPE.

klosax commented 1 year ago

Why not run a hellaswag test on the model to compare with other models? See https://github.com/ggerganov/llama.cpp/discussions/2321

byte-6174 commented 1 year ago

will take a look. need to do more work it appears - as quantization is good at uncovering bugs 😄

klosax commented 1 year ago

It would be interesting to see the scores of different quantization levels of such small model. It is the stoies15M model, right?

byte-6174 commented 1 year ago

yes, 15M. there are also 42M and 110M.

While looking into quantized performance, I found that the 42M model doesn't conform to following n_ff formula. const uint32_t n_ff = ((2*(4*hparams->n_embd)/3 + hparams->n_mult - 1)/hparams->n_mult)*hparams->n_mult;

I can only guess because Andrej found that size doesn't train a decent model.?!

for the other 2 models 15M and 110M, the sizes are okay.

@klosax, I was checking the scores for hellaswag and didn't find that command-line option for ./perplexity? - Do I need to checkout some other branch?!

klosax commented 1 year ago

It should work with the latest release.

byte-6174 commented 1 year ago

not sure what to make of this, but it doesnt print any score. I am running: ./perplexity -m ~/Projects/llama/llama.cpp.fork/llama.cpp/abcd1.bin -f hellaswag_val_full.txt

this is what it prints:

main: build = 899 (41c6741)
main: seed  = 1690856074
llama.cpp: loading model from /Users/aniket/Projects/llama/llama.cpp.fork/llama.cpp/abcd1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 288
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 6
llama_model_load_internal: n_head_kv  = 6
llama_model_load_internal: n_layer    = 6
llama_model_load_internal: n_rot      = 48
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 768
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.02 MB
llama_model_load_internal: mem required  =  395.13 MB (+    3.38 MB per state)
llama_new_context_with_model: kv self size  =    3.38 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3852 chunks, batch_size=512
perplexity: 0.12 seconds per pass - ETA 7 minutes
[1]439.2630,[2]397.4015,[3]482.3224,[4]489.5313,[5]

and then

64.6106,[3850]764.7539,[3851]764.6983,[3852]764.7153,

llama_print_timings:        load time =  3901.01 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 382963.39 ms / 1972224 tokens (    0.19 ms per token,  5149.90 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 577845.67 ms
byte-6174 commented 1 year ago

I have been trying to look into the reasons behind the discrepancy between the results of quantization with the 2 models' 15M and 110M parameters.

It seems for the 110M model the 8 and 4-bit quantization results are really good, whereas for the 15M model, the results show bad outputs.

I'm thinking that given this small number of parameters for the 15M model - the quantization degrades the performance way too much. Not sure what else we can attribute this to.

Has any one done any performance degradation comparison for as small a model as the 15M one before?

TechnotechGit commented 1 year ago

If anything works on the original TinyStories models on HF, there are some smaller models there you could try. I would think some flavour of GGML or GPTQ supports them, at the very least, load_in_4bit.

ggerganov commented 1 year ago

Can you post the full output of the quantise command for the 15M model F32 -> Q8_0?

byte-6174 commented 1 year ago

here:

(py38) ➜  llama.cpp git:(master) ✗ ./quantize abcd1.bin  abcd1-q8_0.bin Q8_0
main: build = 911 (f1c03f4)
main: quantizing 'abcd1.bin' to 'abcd1-q8_0.bin' as Q8_0
llama.cpp: loading model from abcd1.bin
llama.cpp: saving model to abcd1-q8_0.bin
[   1/  57]                tok_embeddings.weight -      288 x 32000, type =    f32, quantizing to q8_0 .. size =    35.16 MB ->     9.34 MB | hist: 0.000 0.023 0.012 0.017 0.020 0.036 0.061 0.107 0.400 0.125 0.063 0.045 0.023 0.019 0.023 0.028
[   2/  57]                          norm.weight -              288, type =    f32, size =    0.001 MB
[   3/  57]                        output.weight -      288 x 32000, type =    f32, quantizing to q8_0 .. size =    35.16 MB ->     9.34 MB | hist: 0.000 0.023 0.012 0.017 0.020 0.036 0.061 0.107 0.400 0.125 0.063 0.045 0.023 0.019 0.023 0.028
[   4/  57]       layers.0.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[   5/  57]         layers.0.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.026 0.020 0.030 0.047 0.067 0.087 0.106 0.227 0.108 0.088 0.067 0.047 0.031 0.020 0.028
[   6/  57]         layers.0.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.019 0.031 0.045 0.066 0.087 0.108 0.228 0.108 0.088 0.066 0.048 0.031 0.020 0.028
[   7/  57]         layers.0.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.031 0.046 0.065 0.086 0.107 0.234 0.107 0.090 0.064 0.046 0.031 0.019 0.027
[   8/  57]         layers.0.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.026 0.018 0.029 0.046 0.065 0.088 0.109 0.238 0.111 0.088 0.065 0.044 0.029 0.018 0.026
[   9/  57]             layers.0.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  10/  57]      layers.0.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.028 0.020 0.032 0.049 0.069 0.089 0.106 0.223 0.105 0.088 0.066 0.047 0.032 0.020 0.027
[  11/  57]      layers.0.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.019 0.032 0.047 0.066 0.087 0.107 0.228 0.107 0.088 0.068 0.047 0.032 0.020 0.027
[  12/  57]      layers.0.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.031 0.047 0.066 0.087 0.106 0.227 0.107 0.088 0.067 0.048 0.032 0.020 0.028
[  13/  57]       layers.1.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[  14/  57]         layers.1.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.021 0.031 0.047 0.067 0.087 0.105 0.229 0.106 0.089 0.066 0.046 0.031 0.020 0.028
[  15/  57]         layers.1.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.087 0.106 0.226 0.106 0.088 0.067 0.047 0.033 0.020 0.026
[  16/  57]         layers.1.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.028 0.019 0.031 0.045 0.066 0.087 0.106 0.230 0.106 0.089 0.067 0.048 0.032 0.020 0.026
[  17/  57]         layers.1.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.019 0.030 0.046 0.066 0.088 0.109 0.231 0.107 0.089 0.066 0.047 0.031 0.018 0.026
[  18/  57]             layers.1.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  19/  57]      layers.1.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.068 0.089 0.106 0.223 0.106 0.088 0.068 0.047 0.031 0.020 0.027
[  20/  57]      layers.1.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.087 0.106 0.227 0.107 0.087 0.066 0.049 0.031 0.020 0.027
[  21/  57]      layers.1.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.068 0.088 0.106 0.223 0.106 0.088 0.066 0.048 0.032 0.020 0.027
[  22/  57]       layers.2.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[  23/  57]         layers.2.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.066 0.088 0.106 0.228 0.107 0.088 0.067 0.047 0.031 0.020 0.027
[  24/  57]         layers.2.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.067 0.087 0.106 0.227 0.105 0.088 0.066 0.048 0.032 0.020 0.028
[  25/  57]         layers.2.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.065 0.088 0.105 0.231 0.108 0.088 0.066 0.047 0.031 0.020 0.027
[  26/  57]         layers.2.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.031 0.049 0.067 0.088 0.108 0.226 0.106 0.086 0.067 0.047 0.032 0.020 0.027
[  27/  57]             layers.2.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  28/  57]      layers.2.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.021 0.032 0.048 0.068 0.088 0.106 0.223 0.104 0.087 0.067 0.048 0.033 0.021 0.028
[  29/  57]      layers.2.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.026 0.020 0.031 0.046 0.066 0.088 0.107 0.228 0.106 0.087 0.067 0.048 0.031 0.020 0.028
[  30/  57]      layers.2.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.067 0.088 0.105 0.224 0.105 0.088 0.068 0.049 0.032 0.020 0.027
[  31/  57]       layers.3.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[  32/  57]         layers.3.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.029 0.020 0.032 0.048 0.068 0.086 0.104 0.225 0.105 0.088 0.068 0.047 0.031 0.021 0.028
[  33/  57]         layers.3.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.019 0.032 0.048 0.067 0.086 0.107 0.226 0.107 0.088 0.066 0.049 0.032 0.019 0.028
[  34/  57]         layers.3.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.029 0.021 0.031 0.048 0.066 0.088 0.106 0.229 0.106 0.088 0.066 0.048 0.031 0.020 0.026
[  35/  57]         layers.3.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.033 0.048 0.067 0.090 0.104 0.226 0.106 0.087 0.066 0.048 0.032 0.019 0.027
[  36/  57]             layers.3.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  37/  57]      layers.3.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.088 0.105 0.223 0.105 0.089 0.068 0.048 0.032 0.020 0.028
[  38/  57]      layers.3.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.031 0.048 0.068 0.087 0.106 0.227 0.105 0.088 0.067 0.048 0.032 0.020 0.027
[  39/  57]      layers.3.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.028 0.020 0.032 0.048 0.066 0.088 0.106 0.223 0.106 0.088 0.067 0.048 0.032 0.020 0.027
[  40/  57]       layers.4.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[  41/  57]         layers.4.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.019 0.031 0.049 0.068 0.088 0.105 0.227 0.106 0.089 0.066 0.047 0.031 0.020 0.027
[  42/  57]         layers.4.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.028 0.019 0.031 0.047 0.065 0.089 0.105 0.230 0.107 0.088 0.067 0.048 0.032 0.019 0.026
[  43/  57]         layers.4.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.019 0.032 0.048 0.066 0.086 0.106 0.231 0.107 0.089 0.065 0.047 0.031 0.019 0.028
[  44/  57]         layers.4.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.026 0.020 0.031 0.048 0.065 0.087 0.105 0.226 0.108 0.089 0.068 0.048 0.032 0.020 0.028
[  45/  57]             layers.4.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  46/  57]      layers.4.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.106 0.223 0.106 0.089 0.067 0.047 0.032 0.020 0.027
[  47/  57]      layers.4.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.086 0.106 0.225 0.106 0.088 0.068 0.047 0.032 0.020 0.027
[  48/  57]      layers.4.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.105 0.224 0.107 0.087 0.067 0.048 0.032 0.020 0.028
[  49/  57]       layers.5.attention_norm.weight -              288, type =    f32, size =    0.001 MB
[  50/  57]         layers.5.attention.wq.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.067 0.088 0.105 0.227 0.104 0.088 0.068 0.048 0.031 0.020 0.028
[  51/  57]         layers.5.attention.wk.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.026 0.020 0.032 0.047 0.067 0.086 0.107 0.226 0.110 0.087 0.066 0.048 0.031 0.019 0.028
[  52/  57]         layers.5.attention.wv.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.028 0.019 0.031 0.048 0.066 0.088 0.109 0.227 0.106 0.087 0.067 0.047 0.031 0.019 0.027
[  53/  57]         layers.5.attention.wo.weight -      288 x   288, type =    f32, quantizing to q8_0 .. size =     0.32 MB ->     0.08 MB | hist: 0.000 0.026 0.020 0.031 0.047 0.066 0.087 0.108 0.225 0.106 0.089 0.068 0.047 0.031 0.020 0.028
[  54/  57]             layers.5.ffn_norm.weight -              288, type =    f32, size =    0.001 MB
[  55/  57]      layers.5.feed_forward.w1.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.048 0.067 0.088 0.106 0.224 0.105 0.089 0.067 0.047 0.031 0.020 0.027
[  56/  57]      layers.5.feed_forward.w2.weight -      768 x   288, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.019 0.030 0.046 0.066 0.087 0.108 0.234 0.107 0.088 0.065 0.047 0.030 0.019 0.027
[  57/  57]      layers.5.feed_forward.w3.weight -      288 x   768, type =    f32, quantizing to q8_0 .. size =     0.84 MB ->     0.22 MB | hist: 0.000 0.027 0.020 0.032 0.047 0.068 0.087 0.106 0.226 0.106 0.087 0.068 0.048 0.032 0.020 0.027
llama_model_quantize_internal: model size  =    93.11 MB
llama_model_quantize_internal: quant size  =    24.74 MB
llama_model_quantize_internal: hist: 0.000 0.024 0.014 0.021 0.026 0.044 0.067 0.107 0.357 0.121 0.069 0.050 0.029 0.022 0.022 0.028

main: quantize time =    63.48 ms
main:    total time =    63.48 ms
klosax commented 1 year ago
Perplexity over wiki.test.raw (ctx512/batch512)   F32 F16 Q8_0 Q5_1 Q4_0 Q2_K
stories-15M 8985.72 8983.72 8957.02 9229.97 9780.31 n/a
stories-42M 257128.74          
stories-110M 1815.13 1818.58 1814.13 1827.72 1887.04 2863.20

The 42M model is broken.

byte-6174 commented 1 year ago

humm, curious what was not doing?! yes, the 42M model has n_ff that doesn't conform to ggml formula mentioned above.

byte-6174 commented 1 year ago

also, what do i compare those numbers to?!

klosax commented 1 year ago

Perplexity measures how good the model is at predicting the contents of a dataset. A lower number is better. See https://github.com/ggerganov/llama.cpp#quantization

byte-6174 commented 1 year ago

Ah, I see, so we look at the differential between F32 and the quantized models. i.e. for 15M - we have F16 (and Q8?) performing near F32 but Q5, Q4 are too far. and for 110M - we have all thru Q4 working okay when compared to its F32 counterpart.

but we have no way to tell how good/bad the converted F32 model itself is in the first place, right?

Also, can you please specify how I can generate these numbers - what in my command above is missing?!

klosax commented 1 year ago

Yes the numbers for different quantization levels are nearly the same.

The dataset is the file wiki.test.raw extracted from the the zip file https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip?ref=blog.salesforceairesearch.com

To run the test use: ./perplexity -c 512 -b 512 -f wiki.test.raw -m model.bin when finished the last number is the measured perplexity over the whole file.

klosax commented 1 year ago

You can compare output from the F32 and the original llama2.c output to see if it is different. Use exact same random seed and prompt.

klosax commented 1 year ago

What I can tell is that the output from both 15M and 110M ggml models looks as expected from such small models.

klosax commented 1 year ago

My mistake, you should use ctx 256 for these small models: ./perplexity -c 256 -b 256 -f wiki.test.raw -m model.bin

byte-6174 commented 1 year ago

still no score, im now running: ./perplexity -m ~/Projects/llama/llama.cpp.fork/llama.cpp/abcd3-Q8_0.bin -f wiki.test.tokens -c 256 -b 256

klosax commented 1 year ago

The datafile should be wiki.test.raw not wiki.test.tokens?

Edit: Checked, and the file is renamed.

byte-6174 commented 1 year ago

the zip from above url only 3 files all ending with .tokens. But I found another file that is named wikitxt-2-raw.zip, usning the raw file from it now, 7 mins.

klosax commented 1 year ago

I found the problem - All the models should have n_mult = 32 42M model works!