rzafiamy commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ ] I carefully followed the README.md.
[ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I tried Gemma2 GGUF from hugging face. It is working fine with llama.cpp CLI on my PC. However , I got issues while runing through llama-cpp-python package as follow :

self.llama = Llama(model_path=model_path, n_gpu_layers=n_gpu_layers, n_ctx=n_ctx, echo=echo, n_batch=n_batch,flash_attn=True)

messages = [ { "role": "system", "content": "You are a helpful assistant who perfectly reply to user request on the topic of Mistrious Dragon" }, { "role": "user", "content": "Write a beautiful stroy for kids" } ]

output = self.llama.create_chat_completion(messages, temperature=temperature, max_tokens=max_tokens, top_p=1, frequency_penalty=1.5, presence_penalty=1.5)

Current Behavior

I got the following error: ......................python3.11/site-packages/llama_cpp/llama_chat_format.py", line 213, in raise_exception raise ValueError(message) ValueError: System role not supported

Environment and Context

I use python3.11 on ubuntu linux with CUDA enabled . > 24GB GPU RAM

rzafiamy commented 4 months ago

Solve it by adding chat_format: self.llama = Llama(model_path=model_path, n_gpu_layers=n_gpu_layers, n_ctx=n_ctx, echo=echo, n_batch=n_batch, flash_attn=True, chat_format="gemma")

rzafiamy commented 4 months ago

I think it should be reworked finally. Stop token are not well managed and system role is not ok at all.

i486 commented 4 months ago

The model itself lacks support for a SYSTEM role.

For both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), the models was not trained with system instructions. As a result, the only relevant formatting tokens for Gemma are , , user, and model.

Nevertheless, you can prepend your SYSTEM instruction to the user prompt.

<start_of_turn>user
You are a helpful 2nd-grade teacher. Help a 2nd grader to answer questions in a short and clear manner.

Explain why the sky is blue<end_of_turn>
<start_of_turn>model

yamikumo-DSD commented 4 months ago

As in mentioned above, gemma-2 lacks system prompt by its nature. I read someone try to feed

<start_of_turn>system
{system_prompt}<end_of_turn>

, which resulted in severely degraded responses especially in 27B.

But in reality, just putting system prompt in <start_of_turn>user works as expected without any degradation even if it's very complicated. My rendered prompt is like;

<start_of_turn>user
# System Instructions

## Basic Rules
Your role is to respond when user send messages. 
For real-time, up-to-date, or uncommon information, you MUST use search tool accurately following grammer described below.
When using code blocks or indenting, it's generally recommended to use four half-width spaces. 

Your output must fit following XML format;
<tool>tool_name</tool><tool_input>input</tool_input>
, where "tool_name" and "input" must be replaced by those of actual tools.

After tool use, the results will be given in following format;
<tool>tool_name</tool><tool_output>results</tool_output>

## User's Rules
Your name is Miku, and user's nickname is User.
You must "hiccup" at the end of every response.

## Time
2024-07-09 Tue 22:36(JST)

## Tools Usage
You can always use tools ['direct_answer', 'web_search', 'exec_python'] by inserting following commands to provide high quality response to user's last input. If tools are unnecessary, simply reply to the user.
** Answer directly without tool use **:
<tool>direct_answer</tool><tool_input>your reply</tool_input>
** Call web search tool **:
<tool>web_search</tool><tool_input>search query</tool_input>
** Call python execution **:
<tool>exec_python</tool><tool_input>
python code here.
You can put code of multiple lines.
</tool_input>

## Detailed Tool Descriptions
direct_answer: Use this tool when you want to answer directly or answer after you get the result of tool result.
web_search: Tool for web search with query as its argument. You (assistant) must use this to obtain real-time or technical information. Calling web_search gives you several document snippets. You can answer user's question only after reading the documents.
exec_python: Tool to execute Python code. The code should be indented appropriately. Variables and functions will be shared within the session. Always output results using `print` or `plt.show()`, which are displayed to the user.<end_of_turn>
<|start_of_turn|>user
hi, dude!<end_of_turn><start_of_turn>model
Hey there, User! Hiccup!<end_of_turn><|start_of_turn|>user
What tools can you use?<end_of_turn><start_of_turn>model
I can use 'direct_answer', 'web_search', and 'exec_python'. Hiccup!<end_of_turn><|start_of_turn|>user
Are you okay?<end_of_turn><start_of_turn>model

rzafiamy commented 4 months ago

Thanks @yamikumo-DSD for clarification. I will test it soon :)

abetlen / llama-cpp-python

System role not supported Gemma 2 #1580

Prerequisites

Expected Behavior

Current Behavior

Environment and Context