Ross-Fan commented 5 months ago

What happened?

We use the convert-hf-to-gguf.py to convert the original model of microsoft/Phi-3-medium-128k-instruct to the fp16 gguf, the size of fp16 gguf is about 27GB then we load the gguf, and chat with it, but ,the output contain nonsense word/code the output just as the picture attached errorcode

Name and Version

./main --version version: 3053 (0541f062) built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

os information:
SMP Debian 5.10.216-1 (2024-05-03) x86_64 GNU/Linux

we use A100 40GB GPU, and the url is at "/v1/chat/completions", 
of course, we use POST method, and the header of http is:
headers: {
            'Connection': 'keep-alive',
            'Content-Type': 'application/json',
            'Accept': 'text/event-stream',
    },
and the body of http like:
const postData = JSON.stringify({
              stream: true,
              n_predict: 512,
              messages: [ ...records, ]
          });
we also tried out different parameter, such as degrade temperature of upgrade it , but all of them didn't work
and the chat template is from phi-3, without any change, "msg":"formatted_chat","text":"<|system|>\nYou are a helpful assistant.<|end|>\n<|user|>\nWhat is the chemical symbol for gold?<|end|>\n<|assistant|>\n"}

BTW, we also tried different Quantization, such as Q5_1, Q4_0, meet the same question

Galunid commented 5 months ago

Looks like you run out of context.

Ross-Fan commented 5 months ago

Looks like you run out of context.

any more tips? We append the previous conversations and role in the messages which is a JSON array in http body, and pass them to server

Galunid commented 5 months ago

Beyond increasing server's context (-c ...), or using smaller models, not really. Phi-3 models are memory hungry since they don't use GQA. You can see https://old.reddit.com/r/LocalLLaMA/comments/1cdhe7o/gemma117b_is_memory_hungry_and_so_is_phi3mini/ Medium is probably going to be worse than mini when it comes to ctx size.

Amadeus-AI commented 5 months ago

i encounter similar question that I only used about 2048 tokens in phi3 and it happens.

Ross-Fan commented 5 months ago

Beyond increasing server's context (-c ...), or using smaller models, not really. Phi-3 models are memory hungry since they don't use GQA. You can see https://old.reddit.com/r/LocalLLaMA/comments/1cdhe7o/gemma117b_is_memory_hungry_and_so_is_phi3mini/ Medium is probably going to be worse than mini when it comes to ctx size.

Thanks for your answer, in my experiment, the model is "phi-3-medium-128k" and the -c is 4096, but I still have the problem: 20240603-164004 the output become repetitious at the second question The version of llama.cpp is commit 549279d8049d78620a2b081e26edb654f83c3bbd

commit 549279d8049d78620a2b081e26edb654f83c3bbd (HEAD -> master, origin/master, origin/HEAD) Author: Georgi Gerganov ggerganov@gmail.com Date: Mon Jun 3 08:34:43 2024 +0300

llama : avoid double token-to-piece cache (#7654)

ggml-ci

Ross-Fan commented 5 months ago

i encounter similar question that I only used about 2048 tokens in phi3 and it happens.

It seems something goes wrong, when handle longer context? I'm not sure~ Do you have the same issue when work with llama-2?

Amadeus-AI commented 5 months ago

i encounter similar question that I only used about 2048 tokens in phi3 and it happens.

It seems something goes wrong, when handle longer context? I'm not sure~ Do you have the same issue when work with llama-2?

So far as I tested, This 2048 issue seems only happened on Phi3 family (included mini 4K, mini 128K).

eamonnmag commented 5 months ago

For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.

I am currently using a commit from 3 months ago and that version is fine.

Ross-Fan commented 5 months ago

For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.

I am currently using a commit from 3 months ago and that version is fine.

I tried llama-2 model with recent version llama.cpp "commit":"549279d8", I found it's output become weird even in llama-2, I'm not sure if there any hiccup in my env, but when I changed to vLLM, everything is ok 20240604-092440

I started the server like this: ./server.llama -m ../models/llama-2-7b-chat.Q8_0.gguf --host 0.0.0.0 --port 80 --parallel 8 --metrics --n-gpu-layers 32 -c 2048 > server.log 2>&1 & the gguf file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main

Amadeus-AI commented 5 months ago

Maybe you can try if older version didn't have this issue?

Galunid commented 5 months ago

@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.

Amadeus-AI commented 5 months ago

@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.

@Galunid https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf its quite new. I also tried a gguf created 14 hours ago https://huggingface.co/kaushiksiva07/Phi-3-mini-4k-instruct-Q4_K_M-GGUF And tried converted myself with latest script. Also failed after 2048 tokens.

7709

eamonnmag commented 5 months ago

Maybe you can try if older version didn't have this issue?

Older versions don't. The commit I referenced from 3 months ago is fine. It seems a definite performance degradation.

Galunid commented 5 months ago

@Amadeus-AI I'm sorry, my bad, I meant to reply to @eamonnmag's comment, not yours

For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.

I am currently using a commit from 3 months ago and that version is fine.

Ross-Fan commented 5 months ago

@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.

Sorry, My application to llama-2-7b was rejected by Meta...... 😭, so , how about let's re-focus on the phi-3 issue

I use the current version "build":3080,"commit":"3b38d486", and convert the microsoft/Phi-3-medium-128k-instruct to gguf with fp16, it's still output non-sense words

20240604-175221

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

Bug: The output of llama.cpp with Phi-3 contains Non-sense/meaningless words, Does anyone encounter the similar problem? #7666

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

7709