ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.08k stars 9.63k forks source link

llama.cpp output vs huggingface output #3030

Closed rjtshrm closed 6 months ago

rjtshrm commented 1 year ago

I finetuned llama2 model using peft lora and finally merged the model and save onto the disk. I added a special token <|end|> and trained on it. If I do inference using huggingface model api, it gives me good results.

However, In llama.cpp since it does not support special tokens yet I changed the eos_token_id in config.json file to that of <|end|> it stoped the output after the answer but weird balck dots nd sometimes special characters which is not the case with huggingface. You can see the screenshot below.

What could be the reason for this. Do I have to play with parameters. or does llama.cpp performance matters when convert weight from hf to gguf format. I am using the quantized model.

image image image
Mihaiii commented 1 year ago

Out of curiosity, what you mean by "I added a special token <|end|> and trained on it. "?

You expanded the vocab or you used <|end|> in all your training inputs as a "trigger" for future sampling (any other reason?)?

staviq commented 1 year ago

If you run this on recent build, through ./main it should generate a debug log, which includes raw tokens, it would be helpful if you could upload the log here, and you can take a look at the debug log yourself too, it will tell you exactly which tokens those black dots map to.

rjtshrm commented 1 year ago

@Mihaiii Since original llama is not trained on EOS token, so on my finetuned data I added <|end|> at the end of each prompt response and also added it to the vocab as you can see the code snippet below

tokenizer.add_special_tokens({
        "additional_special_tokens": [AddedToken("<|end|>")]
})
model.resize_token_embeddings(len(tokenizer))

And during inference I added stopping_criteria based on this special token so that it doesn't generate endless sequence.

Since llama.cpp does not use special_tooken.config and any stopping criteria so I set the id of <|end|> int he config.json mannually. I could see in the logs that it print eos to <|end|>

rjtshrm commented 1 year ago

@staviq here is the log. I don't see the tokens mapping that are printed. I also wonder why it adds sometimes unrelated text like "nobody.com" in this example or out of context text if I re run it multiple times, which I don't get if I do inference using hugignface model.

> Log start
> main: build = 1178 (2ba85c8)
> main: seed  = 1693945519
> llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/lora-7b/ggml-model-q4_0.gguf (version GGUF V2 (latest))
> llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  4096, 32001,     1,     1 ]
> llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor    3:              blk.0.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   10:              blk.1.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   11:              blk.1.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   12:              blk.1.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   19:              blk.2.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   20:              blk.2.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   21:              blk.2.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   28:              blk.3.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   29:              blk.3.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   30:              blk.3.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   37:              blk.4.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   38:              blk.4.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   46:              blk.5.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   47:              blk.5.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   55:              blk.6.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   56:              blk.6.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   57:              blk.6.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   58:         blk.6.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   64:              blk.7.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   65:              blk.7.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   66:              blk.7.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   67:         blk.7.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   73:              blk.8.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   74:              blk.8.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   75:              blk.8.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   76:         blk.8.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   82:              blk.9.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   83:              blk.9.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   84:              blk.9.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   85:         blk.9.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   91:             blk.10.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   92:             blk.10.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   93:             blk.10.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   94:        blk.10.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  100:             blk.11.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  101:             blk.11.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  102:             blk.11.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  103:        blk.11.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  109:             blk.12.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  110:             blk.12.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  111:             blk.12.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  112:        blk.12.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  118:             blk.13.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  119:             blk.13.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  120:             blk.13.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  121:        blk.13.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  127:             blk.14.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  128:             blk.14.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  129:             blk.14.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  130:        blk.14.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  136:             blk.15.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  137:             blk.15.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  138:             blk.15.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  139:        blk.15.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  145:             blk.16.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  146:             blk.16.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  147:             blk.16.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  148:        blk.16.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  154:             blk.17.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  155:             blk.17.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  156:             blk.17.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  157:        blk.17.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  163:             blk.18.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  164:             blk.18.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  165:             blk.18.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  166:        blk.18.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  172:             blk.19.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  173:             blk.19.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  174:             blk.19.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  175:        blk.19.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  181:             blk.20.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  182:             blk.20.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  183:             blk.20.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  184:        blk.20.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  190:             blk.21.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  191:             blk.21.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  192:             blk.21.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  193:        blk.21.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  199:             blk.22.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  200:             blk.22.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  201:             blk.22.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  202:        blk.22.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  208:             blk.23.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  209:             blk.23.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  210:             blk.23.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  211:        blk.23.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  217:             blk.24.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  218:             blk.24.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  219:             blk.24.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  220:        blk.24.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  226:             blk.25.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  227:             blk.25.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  228:             blk.25.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  229:        blk.25.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  235:             blk.26.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  236:             blk.26.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  237:             blk.26.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  238:        blk.26.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  244:             blk.27.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  245:             blk.27.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  246:             blk.27.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  247:        blk.27.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  253:             blk.28.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  254:             blk.28.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  255:             blk.28.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  256:        blk.28.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  262:             blk.29.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  263:             blk.29.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  264:             blk.29.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  265:        blk.29.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  271:             blk.30.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  272:             blk.30.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  273:             blk.30.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  274:        blk.30.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  280:             blk.31.attn_q.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  281:             blk.31.attn_k.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  282:             blk.31.attn_v.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  283:        blk.31.attn_output.weight q4_0     [  4096,  4096,     1,     1 ]
> llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q4_0     [  4096, 11008,     1,     1 ]
> llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q4_0     [ 11008,  4096,     1,     1 ]
> llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
> llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32001,     1,     1 ]
> llama_model_loader: - kv   0:                       general.architecture str     
> llama_model_loader: - kv   1:                               general.name str     
> llama_model_loader: - kv   2:                       llama.context_length u32     
> llama_model_loader: - kv   3:                     llama.embedding_length u32     
> llama_model_loader: - kv   4:                          llama.block_count u32     
> llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
> llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
> llama_model_loader: - kv   7:                 llama.attention.head_count u32     
> llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
> llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
> llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
> llama_model_loader: - kv  11:                          general.file_type u32     
> llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
> llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
> llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
> llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
> llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
> llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
> llama_model_loader: - kv  18:               general.quantization_version u32     
> llama_model_loader: - type  f32:   65 tensors
> llama_model_loader: - type q4_0:  225 tensors
> llama_model_loader: - type q6_K:    1 tensors
> llm_load_print_meta: format         = GGUF V2 (latest)
> llm_load_print_meta: arch           = llama
> llm_load_print_meta: vocab type     = SPM
> llm_load_print_meta: n_vocab        = 32001
> llm_load_print_meta: n_merges       = 0
> llm_load_print_meta: n_ctx_train    = 4096
> llm_load_print_meta: n_ctx          = 512
> llm_load_print_meta: n_embd         = 4096
> llm_load_print_meta: n_head         = 32
> llm_load_print_meta: n_head_kv      = 32
> llm_load_print_meta: n_layer        = 32
> llm_load_print_meta: n_rot          = 128
> llm_load_print_meta: n_gqa          = 1
> llm_load_print_meta: f_norm_eps     = 1.0e-05
> llm_load_print_meta: f_norm_rms_eps = 1.0e-05
> llm_load_print_meta: n_ff           = 11008
> llm_load_print_meta: freq_base      = 10000.0
> llm_load_print_meta: freq_scale     = 1
> llm_load_print_meta: model type     = 7B
> llm_load_print_meta: model ftype    = mostly Q4_0
> llm_load_print_meta: model size     = 6.74 B
> llm_load_print_meta: general.name   = LLaMA v2
> llm_load_print_meta: BOS token = 1 '<s>'
> llm_load_print_meta: EOS token = 32000 '<|end|>'
> llm_load_print_meta: UNK token = 0 '<unk>'
> llm_load_print_meta: LF token  = 13 '<0x0A>'
> llm_load_tensors: ggml ctx size =    0.09 MB
> llm_load_tensors: mem required  = 3647.97 MB (+  256.00 MB per state)
> ..................................................................................................
> llama_new_context_with_model: kv self size  =  256.00 MB
> ggml_metal_init: allocating
> ggml_metal_init: found device: Apple M1 Pro
> ggml_metal_init: picking default device: Apple M1 Pro
> ggml_metal_init: loading '/Users/rsharma/Desktop/llama.cpp/ggml-metal.metal'
> ggml_metal_init: loaded kernel_add                            0x131f07840 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_add_row                        0x131f07f80 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul                            0x131f084c0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_row                        0x131f08b10 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_scale                          0x131f09050 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_silu                           0x131f09590 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_relu                           0x131f09ad0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_gelu                           0x131f0a010 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_soft_max                       0x131f0a6e0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_diag_mask_inf                  0x131f0ad60 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_f16                   0x131f0b430 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q4_0                  0x131f0bc70 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q4_1                  0x131f0c340 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q8_0                  0x131f0ca10 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q2_K                  0x131f0d0e0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q3_K                  0x131f0d7b0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q4_K                  0x131f0de80 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q5_K                  0x131f0e550 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_get_rows_q6_K                  0x131f0ec20 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_rms_norm                       0x131f0f470 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_norm                           0x131f0fb40 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x131f103c0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row           0x131f10c40 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x131f11540 | th_max =  896 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x131f11cc0 | th_max =  896 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q8_0_f32               0x131f12440 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x131f12bc0 | th_max =  640 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x131f13540 | th_max =  704 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x131f13cc0 | th_max =  576 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x131f146a0 | th_max =  576 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x131f14e20 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x131f155e0 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x131f15b20 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                0x131f162e0 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x131f16aa0 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x131f17260 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x131f17a20 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x131f181e0 | th_max =  768 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x131f189a0 | th_max =  704 | th_width =   32
> ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x131f19160 | th_max =  704 | th_width =   32
> ggml_metal_init: loaded kernel_rope                           0x131f196a0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_alibi_f32                      0x131f19f80 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_cpy_f32_f16                    0x131f1a830 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_cpy_f32_f32                    0x131f1b0e0 | th_max = 1024 | th_width =   32
> ggml_metal_init: loaded kernel_cpy_f16_f16                    0x131f1b990 | th_max = 1024 | th_width =   32
> ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MB
> ggml_metal_init: hasUnifiedMemory              = true
> ggml_metal_init: maxTransferRate               = built-in GPU
> llama_new_context_with_model: compute buffer total size =   73.47 MB
> llama_new_context_with_model: max tensor size =   102.54 MB
> ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3648.59 MB, ( 3649.03 / 10922.67)
> ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.48 MB, ( 3650.52 / 10922.67)
> ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB, ( 3908.52 / 10922.67)
> ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    72.02 MB, ( 3980.53 / 10922.67)
> 
> system_info: n_threads = 6 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
> 
> main: prompt: '### Instruction: sprichst du English? ### Response: '
> main: number of tokens in prompt = 15
>      1 -> ''
>    835 -> ' ###'
>   2799 -> ' Inst'
>   4080 -> 'ruction'
>  29901 -> ':'
>   7689 -> ' spr'
>    436 -> 'ich'
>    303 -> 'st'
>    868 -> ' du'
>   4223 -> ' English'
>  29973 -> '?'
>    835 -> ' ###'
>  13291 -> ' Response'
>  29901 -> ':'
>  29871 -> ' '
> 
> sampling: repeat_last_n = 64, repeat_penalty = 1.150000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
> generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
> 
> 
>  ### Instruction: sprichst du English? ### Response: ▅ Ja, ich spreche Englisch. nobody.com [end of text]
> 
> llama_print_timings:        load time =   283.89 ms
> llama_print_timings:      sample time =    10.21 ms /    14 runs   (    0.73 ms per token,  1370.67 tokens per second)
> llama_print_timings: prompt eval time =   154.00 ms /    15 tokens (   10.27 ms per token,    97.40 tokens per second)
> llama_print_timings:        eval time =   367.37 ms /    13 runs   (   28.26 ms per token,    35.39 tokens per second)
> llama_print_timings:       total time =   535.09 ms
> ggml_metal_free: deallocating
> Log end
staviq commented 1 year ago

I'm pretty sure that's the output, not the log file. There should be a file named main.somenumber.log in the directory you ran ./main from.

rjtshrm commented 1 year ago

My mistake, here is the log file

main.0x1e5382080.log

staviq commented 1 year ago

I can't see any weird characters in that particular log you provided, but I noticed you included a space at the end of the prompt

Long story short, try without space at the end of the prompt.

Long story long: This is not 100% a rule, but definitely applies to many models, take this example: This is a sentence. it would more or less be tokenized as This, is, a, sentence, . notice how spaces are not separate tokens but rather get glued with the word on the right and make one token. Many tokens have a space on the left. This is vaguely related to how LLM can differentiate between start of the sentence, middle of a sentence, and words that don't have their own tokens and are represented by couple of tokens, and in that case, the first "subtoken" ( that's not really a thing, I'm using it for the purpose of the explaination ) includes a space, and next "subtokens" don't.

For example, the word Instruction: from your prompt, gets tokenized as Inst, ruction,: ( first token has a space, next one doesn't )

Similarly sprichst turns into spr, ich, st

So when you end the prompt with a space, that space effectively makes it harder for the LLM to find a word, because it can't really put two spaces together, and instead it tries to match weird "subtokens".

Since LLM works token by token, and each iteration the new token it found, get added back to the prompt, and sent for another iteration, LLM is oblivious to the fact a weird token was it's doing and not yours, it gets heavily inspired by the presence of weird token and starts to use weird tokens more than it should.

I did some testing, and I can basically recreate this problem with any llama2 based model. Including a space at the end of the prompt will almost always make the first generated token to not be a word.

staviq commented 1 year ago

@Cebtenzzre I just noticed, including a space at the end of the prompt makes the output always the same, regardless of the seed

Edit: with different models too ( different output for different models, but always the same each run, independently from the seed )

cebtenzzre commented 1 year ago

Sorry, I only have limited understanding of the tokenizer. @ggerganov will have to take a look. Trailing spaces are probably supposed to be stripped at some level, as I believe they normally belong to the following word.

staviq commented 1 year ago

Sorry, I only have limited understanding of the tokenizer. @ggerganov will have to take a look. Trailing spaces are probably supposed to be stripped at some level, as I believe they normally belong to the following word.

Scratch that, I just realized rng seed is only used in sampling, and I had temp at 0 for testing. For some reason I thought generation is stochastic like Markov chains.

So the only problem seems allowing trailing space, which causes high probability of weird tokens.

ggerganov commented 1 year ago

Not sure but this could be related to #2421

Try to use Q4_K, Q4_1 or Q5_K quantization if possible and see if the issue persists. Also, no need to add the trailing space in your prompt. Let me know the results of this experiment + new log file(s)

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.