Closed Ross-Fan closed 4 months ago
Looks like you run out of context.
Looks like you run out of context.
any more tips? We append the previous conversations and role in the messages which is a JSON array in http body, and pass them to server
Beyond increasing server's context (-c ...
), or using smaller models, not really. Phi-3 models are memory hungry since they don't use GQA. You can see https://old.reddit.com/r/LocalLLaMA/comments/1cdhe7o/gemma117b_is_memory_hungry_and_so_is_phi3mini/
Medium is probably going to be worse than mini when it comes to ctx size.
i encounter similar question that I only used about 2048 tokens in phi3 and it happens.
Beyond increasing server's context (
-c ...
), or using smaller models, not really. Phi-3 models are memory hungry since they don't use GQA. You can see https://old.reddit.com/r/LocalLLaMA/comments/1cdhe7o/gemma117b_is_memory_hungry_and_so_is_phi3mini/ Medium is probably going to be worse than mini when it comes to ctx size.
Thanks for your answer, in my experiment, the model is "phi-3-medium-128k" and the -c is 4096, but I still have the problem: the output become repetitious at the second question The version of llama.cpp is commit 549279d8049d78620a2b081e26edb654f83c3bbd
commit 549279d8049d78620a2b081e26edb654f83c3bbd (HEAD -> master, origin/master, origin/HEAD) Author: Georgi Gerganov ggerganov@gmail.com Date: Mon Jun 3 08:34:43 2024 +0300
llama : avoid double token-to-piece cache (#7654)
ggml-ci
i encounter similar question that I only used about 2048 tokens in phi3 and it happens.
It seems something goes wrong, when handle longer context? I'm not sure~ Do you have the same issue when work with llama-2?
i encounter similar question that I only used about 2048 tokens in phi3 and it happens.
It seems something goes wrong, when handle longer context? I'm not sure~ Do you have the same issue when work with llama-2?
So far as I tested, This 2048 issue seems only happened on Phi3 family (included mini 4K, mini 128K).
For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.
I am currently using a commit from 3 months ago and that version is fine.
For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.
I am currently using a commit from 3 months ago and that version is fine.
I tried llama-2 model with recent version llama.cpp "commit":"549279d8", I found it's output become weird even in llama-2, I'm not sure if there any hiccup in my env, but when I changed to vLLM, everything is ok
I started the server like this: ./server.llama -m ../models/llama-2-7b-chat.Q8_0.gguf --host 0.0.0.0 --port 80 --parallel 8 --metrics --n-gpu-layers 32 -c 2048 > server.log 2>&1 & the gguf file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main
Maybe you can try if older version didn't have this issue?
@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.
@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.
@Galunid https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf its quite new. I also tried a gguf created 14 hours ago https://huggingface.co/kaushiksiva07/Phi-3-mini-4k-instruct-Q4_K_M-GGUF And tried converted myself with latest script. Also failed after 2048 tokens.
Maybe you can try if older version didn't have this issue?
Older versions don't. The commit I referenced from 3 months ago is fine. It seems a definite performance degradation.
@Amadeus-AI I'm sorry, my bad, I meant to reply to @eamonnmag's comment, not yours
For me the problems are happening on Mixtral and Zephyr at least with the latest releases - the output is non-sense.
I am currently using a commit from 3 months ago and that version is fine.
@Ross-Fan That llama-2 gguf is definitely outdated and it explains performance degradation for that model. Please try converting it yourself with the recent script. @Amadeus-AI Same question to you, when were the models created? If they are older than about 2 months, they most likely need to be remade.
Sorry, My application to llama-2-7b was rejected by Meta...... 😭, so , how about let's re-focus on the phi-3 issue
I use the current version "build":3080,"commit":"3b38d486", and convert the microsoft/Phi-3-medium-128k-instruct to gguf with fp16, it's still output non-sense words
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
We use the convert-hf-to-gguf.py to convert the original model of microsoft/Phi-3-medium-128k-instruct to the fp16 gguf, the size of fp16 gguf is about 27GB then we load the gguf, and chat with it, but ,the output contain nonsense word/code the output just as the picture attached
Name and Version
./main --version version: 3053 (0541f062) built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output