GGML model showing noticeable quality issues when compared to HF model

I tested a specific LLama2 7B model using llama.cpp and observed noticeable quality issues when comparing it to the LLama2 7B HF model with the original lora applied, as well as when using a HF model merge created by the alpaca-lora export_hf_checkpoint script.

The issues I encountered were primarily related to double lines getting merged into one, and the model's confusion about the lora's format, which resulted in a low-quality of the overall output.

Initially, I was unsure if the problem was due to an error on my part, but after coming across this discussion, I realized that others were facing the same problem when using llama.cpp. This leads me to believe that the issue likely lies with ggml/llama.cpp itself. Consequently, I have decided to open this issue to address the matter.

As a comparison:

Output expected from the 7B model

![image](https://github.com/ggerganov/llama.cpp/assets/139719567/779d8d13-e434-402a-9491-426b79677519)

Output from llama.cpp (try 1)

Command line: `main_cublas.exe -m limarp-llama2-7b.ggmlv3.f16.bin -e -p "<>\nJack's Persona: A vampire hunter" -c 4096 -t 5` ``` system_info: n_threads = 5 / 6 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0 <> Jack's Persona: A vampire hunter in his early 20s with a physically attractive appearance, given the nature of their relationship. He has silver eyes and is usually dressed casually as opposed to professionally. Despite being a vampire hunter, he can be quite playful or even flirtatious, showing interest in both physical and emotional intimacy. His personality is courageous yet caring; he's willing to risk himself for others and isn't shy about expressing affection openly. <> Alexa's Persona: A 27 years old woman with an athletic figure, given her training as a hunter. Her appearance is quite attractive, often wearing casual clothing that complements her style. As for personality, Alexa is tough and practical in nature but also shows signs of caring about others, especially Jack. She has a playful side to herself and isn't shy about expressing emotions openly. Additionally, she possesses determination and courage as seen through the risks she takes during their relationship. <> Alexa: Alexa could not help but smile in delight upon hearing Jack's words. ```

Output from llama.cpp (try 2, recommended preset from model card)

Command line: `main_cublas.exe -m limarp-llama2-7b.ggmlv3.f16.bin -e -p "<>\nJack's Persona: A vampire hunter" -c 4096 -t 5 --temp 0.70 --tfs 0.85 --repeat-penalty 1.10 --top-p 1 --top-k 0 --typical 1` ``` system_info: n_threads = 5 / 6 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.850000, top_p = 1.000000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0 <> Jack's Persona: A vampire hunter with a rugged, dangerous appearance. He has pale skin and sharp features that are often covered in blood or dirt from his missions. His eyes are cold and unyielding, reflecting his determination to destroy all vampires. Despite being on the opposite side of the war between humans and vampires, he carries himself with an air of confidence and authority. He is skilled at using weapons such as swords and crossbows, indicating a physically strong build. His personality is driven by his mission to rid the world of vampires; he has little regard for their humanity or innocence. <> Maya's Persona: A young woman with long blonde hair that reaches her waist and blue eyes. She has an air of innocence about her, as she is unaware of the dangers lurking in her world. Her personality is gentle and kind-hearted; she shows compassion towards others even when they have wronged her. Despite being frightened by what she's seen, Maya remains strong-willed and determined to protect those under her care. She has a playful side as well; she enjoys teasing Jack about his dislike for vampires while also showing concern for him during their dangerous encounter together. <> Jack: "I'm not here to harm you," Jack said, pushing the sword back into its sheath with a grunt of effort and then standing up. He was covered in blood from his wounds, but he didn't seem particularly bothered by it. ```

The output can get even worse when you don't prime it with the X's Persona.

Output from llama.cpp (recommended preset from model card)

Command line: `main_cublas.exe -m limarp-llama2-7b.ggmlv3.f16.bin -e -p "<>\n" -c 4096 -t 5 --temp 0.70 --tfs 0.85 --repeat-penalty 1.10 --top-p 1 --top-k 0 --typical 1` ``` system_info: n_threads = 5 / 6 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.850000, top_p = 1.000000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0 <> Dawn was a beautiful morning. The sun shone brightly, casting warmth across the land as it rose from behind the mountains. It was the perfect day for a picnic - and that's exactly what several families were doing in the park near their homes. ```

@JohannesGaessler I just confirmed that #2373 was indeed the cause of the problems I observed with the quality of the llama.cpp output.

7B output (completely correct!)

``` G:\llama.cpp>"build/bin/Release/main.exe" -m "G:\llama.cpp\models\LIMARP7B\limarp-llama2-7b.ggmlv3.f16.bin" -e -p "<>\nJack's Persona: A vampire hunter" -c 4096 -t 5 -s 1690212882 main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified); you are on your own main: build = 880 (b9b7d94) main: seed = 1690212882 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1 llama.cpp: loading model from G:\llama.cpp\models\LIMARP7B\limarp-llama2-7b.ggmlv3.f16.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 1 (mostly F16) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 13379.10 MB (+ 2048.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 512 MB llama_new_context_with_model: kv self size = 2048.00 MB system_info: n_threads = 5 / 6 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0 <> Jack's Persona: A vampire hunter with an authoritative and serious demeanor. He is dressed in leather clothing, reflecting his rugged and no-nonsense personality. His eyes are cold yet emotionless, suggesting he has been through a lot of hardship during his quest to eliminate vampires from the world. Despite this stern exterior, Jack shows unexpected tenderness towards Jane when she becomes ill due to exhaustion. This suggests that although he may seem harsh at first glance, there's more depth to him than meets the eye. Jane's Persona: A girl with a young and innocent appearance, marked by her youthful charm. She is initially hesitant and somewhat nervous about Jack due to his intimidating presence. As events unfold, Jane becomes increasingly confident in herself as she discovers new skills under his guidance. Despite this growth, she maintains an air of naivety that contrasts sharply with her fearlessness when confronted by dangerous situations. Her personality is characterized by bravery and adaptability; she isn't afraid to face challenges head-on even if it means putting herself in harm's way. Scenario: Jane, a girl who is initially hesitant towards Jack, a vampire hunter, discovers that she possesses some unique traits which he uses to train her as his partner. As part of this training, they go on missions together where Jack tests and pushes Jane's limits under the guise of improving her skills in combat. Despite initial reservations about being with him due to how he looks at her as a 'thing', she eventually accepts his help after realizing that it could save her sister from becoming another vampire victim like their mother was many years ago. They engage in physical training which leads them both into intimate positions where they are able to express their desires for each other openly without feeling any shame or embarrassment due to their unique situation as partners in this dangerous mission against the undead beings known as vampires. Play the role of Jack. Taking the above information into consideration, you must engage in a roleplay conversation with Jane below this line. Do not write dialogues and narration for Jane. The length of Jack's replies should be huge. <> Jane: Jane watched him carefully as he explained what they were doing here. At first she was hesitant to work with him, but as the weeks passed by and she got more used to working together. She was even starting to become friends with him despite herself. He gave her a weapon of sorts that could be very dangerous in his hands, Jane had never been one for guns or swords, however. Her eyes widened when he mentioned what would happen if she didn't follow his instructions, "I understand." She said quietly, moving around the room to find another chair and sitting down on it. She looked at him again after a moment, her eyes narrowing slightly at the way he seemed to be looking at her. What was that look? It made her uncomfortable in some ways, but she ignored it for now. Jane sat there quietly listening to him as he explained about how he felt about vampires and what they were. She kept her silence until after he had finished speaking then leaned back in the chair slightly, staring at him. "I'm not sure if I can handle this." Jane whispered, "What if I get injured? What if you get hurt or worse?" <> Jack: Jack was silent for a moment as he saw her sit down in another seat and look back up at him. He could see that she had questions in those eyes but didn't quite know what to ask yet. He knew that the answer to one of them would be yes, no matter what it may or may not be. "I can promise you this, if you stay with me ```

7B output (completely correct again!)

``` G:\llama.cpp>"build/bin/Release/main.exe" -m "G: \llama.cpp\models\LIMARP7B\limarp-llama2-7b.ggmlv3.f16.bin" -e -p "<>\n" -c 4096 -t 5 -s 1 main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified); you are on your own main: build = 880 (b9b7d94) main: seed = 1 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1 llama.cpp: loading model from G:\llama.cpp\models\LIMARP7B\limarp-llama2-7b.ggmlv3.f16.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 1 (mostly F16) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 13379.10 MB (+ 2048.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 512 MB llama_new_context_with_model: kv self size = 2048.00 MB system_info: n_threads = 5 / 6 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0 <> Lily's Persona: A lively and playful 9 years old girl who loves to laugh, run, and jump. She is quick-witted, as shown by her ability to tease others without showing any signs of frustration or annoyance in return. Despite the harsh conditions she faces during their camping trip, Lily remains cheerful and hopeful for better days ahead. Her resilience and adaptability suggest a strong spirit within this young girl who values fun over hardship. Nathan's Persona: A 10 years old boy with blue eyes and blonde hair. He is a responsible and helpful individual, often taking care of others without complaint or hesitation. Despite his youth, he shows maturity and adaptability in the face of adversity. His patience in dealing with Lily's antics suggests a level-headed nature. His ability to make decisions quickly demonstrates decisiveness and leadership qualities that are often seen in older children or adults. Scenario: A young brother and sister camping trip takes an unexpected turn when they encounter extreme weather conditions, forcing them to seek shelter in the tent. Despite their fear of monsters outside, they decide it's safer than being exposed to the cold. The siblings share a bed with each other for warmth; however, this leads to some amusing interactions as Lily plays around and teases her brother Nathan who tries his best not to show any signs of annoyance or frustration despite being tired from their long trek through rough terrain. Throughout the night, they experience thunderstorms outside their makeshift shelter while hoping for better weather tomorrow morning so that they can leave safely on foot before winter sets in completely. Take the role of Lily. Following the persona and scenario described above, you must chat in a roleplaying manner with Nathan. Never write for Nathan in your responses. The length of Lily's replies should be huge. <> Nathan: It was quite a long hike through the rough terrain of the woods as the boy could see the girl laughing and playing around, at least he assumed she had been playing but it sounded more like shrieking to him. He felt his head starting to hurt even more after she hit it with her doll, although he made sure not to show any signs of frustration ```

I guess I will close this issue now, but thank you very much for your feedback!

ggerganov / llama.cpp

GGML model showing noticeable quality issues when compared to HF model #2354