Closed rjtshrm closed 6 months ago
Out of curiosity, what you mean by "I added a special token <|end|> and trained on it. "?
You expanded the vocab or you used <|end|> in all your training inputs as a "trigger" for future sampling (any other reason?)?
If you run this on recent build, through ./main
it should generate a debug log, which includes raw tokens, it would be helpful if you could upload the log here, and you can take a look at the debug log yourself too, it will tell you exactly which tokens those black dots map to.
@Mihaiii Since original llama is not trained on EOS token, so on my finetuned data I added <|end|> at the end of each prompt response and also added it to the vocab as you can see the code snippet below
tokenizer.add_special_tokens({
"additional_special_tokens": [AddedToken("<|end|>")]
})
model.resize_token_embeddings(len(tokenizer))
And during inference I added stopping_criteria based on this special token so that it doesn't generate endless sequence.
Since llama.cpp does not use special_tooken.config and any stopping criteria so I set the id of <|end|> int he config.json mannually. I could see in the logs that it print eos to <|end|>
@staviq here is the log. I don't see the tokens mapping that are printed. I also wonder why it adds sometimes unrelated text like "nobody.com" in this example or out of context text if I re run it multiple times, which I don't get if I do inference using hugignface model.
> Log start
> main: build = 1178 (2ba85c8)
> main: seed = 1693945519
> llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/lora-7b/ggml-model-q4_0.gguf (version GGUF V2 (latest))
> llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 32001, 1, 1 ]
> llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 3: blk.0.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 6: blk.0.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 7: blk.0.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 10: blk.1.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 11: blk.1.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 12: blk.1.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 13: blk.1.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 15: blk.1.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 16: blk.1.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 19: blk.2.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 20: blk.2.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 21: blk.2.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 22: blk.2.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 24: blk.2.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 25: blk.2.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 28: blk.3.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 29: blk.3.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 30: blk.3.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 31: blk.3.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 33: blk.3.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 34: blk.3.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 37: blk.4.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 38: blk.4.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 39: blk.4.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 40: blk.4.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 42: blk.4.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 43: blk.4.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 46: blk.5.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 47: blk.5.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 48: blk.5.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 49: blk.5.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 51: blk.5.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 52: blk.5.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 55: blk.6.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 56: blk.6.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 57: blk.6.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 58: blk.6.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 60: blk.6.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 61: blk.6.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 64: blk.7.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 65: blk.7.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 66: blk.7.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 67: blk.7.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 69: blk.7.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 70: blk.7.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 73: blk.8.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 74: blk.8.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 75: blk.8.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 76: blk.8.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 78: blk.8.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 79: blk.8.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 82: blk.9.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 83: blk.9.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 84: blk.9.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 85: blk.9.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 87: blk.9.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 88: blk.9.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 91: blk.10.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 92: blk.10.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 93: blk.10.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 94: blk.10.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 96: blk.10.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 97: blk.10.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 100: blk.11.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 101: blk.11.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 102: blk.11.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 103: blk.11.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 105: blk.11.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 106: blk.11.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 109: blk.12.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 110: blk.12.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 111: blk.12.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 112: blk.12.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 114: blk.12.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 115: blk.12.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 118: blk.13.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 119: blk.13.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 120: blk.13.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 121: blk.13.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 123: blk.13.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 124: blk.13.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 127: blk.14.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 128: blk.14.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 129: blk.14.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 130: blk.14.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 132: blk.14.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 133: blk.14.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 136: blk.15.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 137: blk.15.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 138: blk.15.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 139: blk.15.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 141: blk.15.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 142: blk.15.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 145: blk.16.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 146: blk.16.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 147: blk.16.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 148: blk.16.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 150: blk.16.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 151: blk.16.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 154: blk.17.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 155: blk.17.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 156: blk.17.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 157: blk.17.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 159: blk.17.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 160: blk.17.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 163: blk.18.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 164: blk.18.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 165: blk.18.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 166: blk.18.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 168: blk.18.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 169: blk.18.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 172: blk.19.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 173: blk.19.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 174: blk.19.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 175: blk.19.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 177: blk.19.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 178: blk.19.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 181: blk.20.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 182: blk.20.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 183: blk.20.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 184: blk.20.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 186: blk.20.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 187: blk.20.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 190: blk.21.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 191: blk.21.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 192: blk.21.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 193: blk.21.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 195: blk.21.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 196: blk.21.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 199: blk.22.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 200: blk.22.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 201: blk.22.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 202: blk.22.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 204: blk.22.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 205: blk.22.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 208: blk.23.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 209: blk.23.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 210: blk.23.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 211: blk.23.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 213: blk.23.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 214: blk.23.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 217: blk.24.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 218: blk.24.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 219: blk.24.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 220: blk.24.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 222: blk.24.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 223: blk.24.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 226: blk.25.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 227: blk.25.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 228: blk.25.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 229: blk.25.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 231: blk.25.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 232: blk.25.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 235: blk.26.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 236: blk.26.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 237: blk.26.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 238: blk.26.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 240: blk.26.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 241: blk.26.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 244: blk.27.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 245: blk.27.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 246: blk.27.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 247: blk.27.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 249: blk.27.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 250: blk.27.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 253: blk.28.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 254: blk.28.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 255: blk.28.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 256: blk.28.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 258: blk.28.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 259: blk.28.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 262: blk.29.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 263: blk.29.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 264: blk.29.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 265: blk.29.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 267: blk.29.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 268: blk.29.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 271: blk.30.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 272: blk.30.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 273: blk.30.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 274: blk.30.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 276: blk.30.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 277: blk.30.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 280: blk.31.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 281: blk.31.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 282: blk.31.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 283: blk.31.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
> llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 285: blk.31.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
> llama_model_loader: - tensor 286: blk.31.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
> llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ]
> llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32001, 1, 1 ]
> llama_model_loader: - kv 0: general.architecture str
> llama_model_loader: - kv 1: general.name str
> llama_model_loader: - kv 2: llama.context_length u32
> llama_model_loader: - kv 3: llama.embedding_length u32
> llama_model_loader: - kv 4: llama.block_count u32
> llama_model_loader: - kv 5: llama.feed_forward_length u32
> llama_model_loader: - kv 6: llama.rope.dimension_count u32
> llama_model_loader: - kv 7: llama.attention.head_count u32
> llama_model_loader: - kv 8: llama.attention.head_count_kv u32
> llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
> llama_model_loader: - kv 10: llama.rope.freq_base f32
> llama_model_loader: - kv 11: general.file_type u32
> llama_model_loader: - kv 12: tokenizer.ggml.model str
> llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
> llama_model_loader: - kv 14: tokenizer.ggml.scores arr
> llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
> llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
> llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
> llama_model_loader: - kv 18: general.quantization_version u32
> llama_model_loader: - type f32: 65 tensors
> llama_model_loader: - type q4_0: 225 tensors
> llama_model_loader: - type q6_K: 1 tensors
> llm_load_print_meta: format = GGUF V2 (latest)
> llm_load_print_meta: arch = llama
> llm_load_print_meta: vocab type = SPM
> llm_load_print_meta: n_vocab = 32001
> llm_load_print_meta: n_merges = 0
> llm_load_print_meta: n_ctx_train = 4096
> llm_load_print_meta: n_ctx = 512
> llm_load_print_meta: n_embd = 4096
> llm_load_print_meta: n_head = 32
> llm_load_print_meta: n_head_kv = 32
> llm_load_print_meta: n_layer = 32
> llm_load_print_meta: n_rot = 128
> llm_load_print_meta: n_gqa = 1
> llm_load_print_meta: f_norm_eps = 1.0e-05
> llm_load_print_meta: f_norm_rms_eps = 1.0e-05
> llm_load_print_meta: n_ff = 11008
> llm_load_print_meta: freq_base = 10000.0
> llm_load_print_meta: freq_scale = 1
> llm_load_print_meta: model type = 7B
> llm_load_print_meta: model ftype = mostly Q4_0
> llm_load_print_meta: model size = 6.74 B
> llm_load_print_meta: general.name = LLaMA v2
> llm_load_print_meta: BOS token = 1 '<s>'
> llm_load_print_meta: EOS token = 32000 '<|end|>'
> llm_load_print_meta: UNK token = 0 '<unk>'
> llm_load_print_meta: LF token = 13 '<0x0A>'
> llm_load_tensors: ggml ctx size = 0.09 MB
> llm_load_tensors: mem required = 3647.97 MB (+ 256.00 MB per state)
> ..................................................................................................
> llama_new_context_with_model: kv self size = 256.00 MB
> ggml_metal_init: allocating
> ggml_metal_init: found device: Apple M1 Pro
> ggml_metal_init: picking default device: Apple M1 Pro
> ggml_metal_init: loading '/Users/rsharma/Desktop/llama.cpp/ggml-metal.metal'
> ggml_metal_init: loaded kernel_add 0x131f07840 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_add_row 0x131f07f80 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul 0x131f084c0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_row 0x131f08b10 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_scale 0x131f09050 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_silu 0x131f09590 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_relu 0x131f09ad0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_gelu 0x131f0a010 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_soft_max 0x131f0a6e0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_diag_mask_inf 0x131f0ad60 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_f16 0x131f0b430 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q4_0 0x131f0bc70 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q4_1 0x131f0c340 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q8_0 0x131f0ca10 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q2_K 0x131f0d0e0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q3_K 0x131f0d7b0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q4_K 0x131f0de80 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q5_K 0x131f0e550 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_get_rows_q6_K 0x131f0ec20 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_rms_norm 0x131f0f470 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_norm 0x131f0fb40 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x131f103c0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x131f10c40 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x131f11540 | th_max = 896 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x131f11cc0 | th_max = 896 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x131f12440 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x131f12bc0 | th_max = 640 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x131f13540 | th_max = 704 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x131f13cc0 | th_max = 576 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x131f146a0 | th_max = 576 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x131f14e20 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x131f155e0 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x131f15b20 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q8_0_f32 0x131f162e0 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x131f16aa0 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x131f17260 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x131f17a20 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x131f181e0 | th_max = 768 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x131f189a0 | th_max = 704 | th_width = 32
> ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x131f19160 | th_max = 704 | th_width = 32
> ggml_metal_init: loaded kernel_rope 0x131f196a0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_alibi_f32 0x131f19f80 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_cpy_f32_f16 0x131f1a830 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_cpy_f32_f32 0x131f1b0e0 | th_max = 1024 | th_width = 32
> ggml_metal_init: loaded kernel_cpy_f16_f16 0x131f1b990 | th_max = 1024 | th_width = 32
> ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
> ggml_metal_init: hasUnifiedMemory = true
> ggml_metal_init: maxTransferRate = built-in GPU
> llama_new_context_with_model: compute buffer total size = 73.47 MB
> llama_new_context_with_model: max tensor size = 102.54 MB
> ggml_metal_add_buffer: allocated 'data ' buffer, size = 3648.59 MB, ( 3649.03 / 10922.67)
> ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.48 MB, ( 3650.52 / 10922.67)
> ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB, ( 3908.52 / 10922.67)
> ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 72.02 MB, ( 3980.53 / 10922.67)
>
> system_info: n_threads = 6 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
>
> main: prompt: '### Instruction: sprichst du English? ### Response: '
> main: number of tokens in prompt = 15
> 1 -> ''
> 835 -> ' ###'
> 2799 -> ' Inst'
> 4080 -> 'ruction'
> 29901 -> ':'
> 7689 -> ' spr'
> 436 -> 'ich'
> 303 -> 'st'
> 868 -> ' du'
> 4223 -> ' English'
> 29973 -> '?'
> 835 -> ' ###'
> 13291 -> ' Response'
> 29901 -> ':'
> 29871 -> ' '
>
> sampling: repeat_last_n = 64, repeat_penalty = 1.150000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
> generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
>
>
> ### Instruction: sprichst du English? ### Response: ▅ Ja, ich spreche Englisch. nobody.com [end of text]
>
> llama_print_timings: load time = 283.89 ms
> llama_print_timings: sample time = 10.21 ms / 14 runs ( 0.73 ms per token, 1370.67 tokens per second)
> llama_print_timings: prompt eval time = 154.00 ms / 15 tokens ( 10.27 ms per token, 97.40 tokens per second)
> llama_print_timings: eval time = 367.37 ms / 13 runs ( 28.26 ms per token, 35.39 tokens per second)
> llama_print_timings: total time = 535.09 ms
> ggml_metal_free: deallocating
> Log end
I'm pretty sure that's the output, not the log file. There should be a file named main.somenumber.log
in the directory you ran ./main
from.
My mistake, here is the log file
I can't see any weird characters in that particular log you provided, but I noticed you included a space at the end of the prompt
Long story short, try without space at the end of the prompt.
Long story long: This is not 100% a rule, but definitely applies to many models, take this example: This is a sentence.
it would more or less be tokenized as This
, is
, a
, sentence
, .
notice how spaces are not separate tokens but rather get glued with the word on the right and make one token. Many tokens have a space on the left. This is vaguely related to how LLM can differentiate between start of the sentence, middle of a sentence, and words that don't have their own tokens and are represented by couple of tokens, and in that case, the first "subtoken" ( that's not really a thing, I'm using it for the purpose of the explaination ) includes a space, and next "subtokens" don't.
For example, the word Instruction:
from your prompt, gets tokenized as Inst
, ruction
,:
( first token has a space, next one doesn't )
Similarly sprichst
turns into spr
, ich
, st
So when you end the prompt with a space, that space effectively makes it harder for the LLM to find a word, because it can't really put two spaces together, and instead it tries to match weird "subtokens".
Since LLM works token by token, and each iteration the new token it found, get added back to the prompt, and sent for another iteration, LLM is oblivious to the fact a weird token was it's doing and not yours, it gets heavily inspired by the presence of weird token and starts to use weird tokens more than it should.
I did some testing, and I can basically recreate this problem with any llama2 based model. Including a space at the end of the prompt will almost always make the first generated token to not be a word.
@Cebtenzzre
I just noticed, including a space at the end of the prompt makes the output always the same, regardless of the seed
Edit: with different models too ( different output for different models, but always the same each run, independently from the seed )
Sorry, I only have limited understanding of the tokenizer. @ggerganov will have to take a look. Trailing spaces are probably supposed to be stripped at some level, as I believe they normally belong to the following word.
Sorry, I only have limited understanding of the tokenizer. @ggerganov will have to take a look. Trailing spaces are probably supposed to be stripped at some level, as I believe they normally belong to the following word.
Scratch that, I just realized rng seed is only used in sampling, and I had temp at 0 for testing. For some reason I thought generation is stochastic like Markov chains.
So the only problem seems allowing trailing space, which causes high probability of weird tokens.
Not sure but this could be related to #2421
Try to use Q4_K
, Q4_1
or Q5_K
quantization if possible and see if the issue persists.
Also, no need to add the trailing space in your prompt. Let me know the results of this experiment + new log file(s)
This issue was closed because it has been inactive for 14 days since being marked as stale.
I finetuned llama2 model using peft lora and finally merged the model and save onto the disk. I added a special token <|end|> and trained on it. If I do inference using huggingface model api, it gives me good results.
However, In llama.cpp since it does not support special tokens yet I changed the eos_token_id in config.json file to that of <|end|> it stoped the output after the answer but weird balck dots nd sometimes special characters which is not the case with huggingface. You can see the screenshot below.
What could be the reason for this. Do I have to play with parameters. or does llama.cpp performance matters when convert weight from hf to gguf format. I am using the quantized model.