Open rzafiamy opened 4 months ago
Solve it by adding chat_format: self.llama = Llama(model_path=model_path, n_gpu_layers=n_gpu_layers, n_ctx=n_ctx, echo=echo, n_batch=n_batch, flash_attn=True, chat_format="gemma")
I think it should be reworked finally. Stop token are not well managed and system role is not ok at all.
The model itself lacks support for a SYSTEM role.
For both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), the models was not trained with system instructions. As a result, the only relevant formatting tokens for Gemma are
, , user, and model.
Nevertheless, you can prepend your SYSTEM instruction to the user prompt.
<start_of_turn>user
You are a helpful 2nd-grade teacher. Help a 2nd grader to answer questions in a short and clear manner.
Explain why the sky is blue<end_of_turn>
<start_of_turn>model
As in mentioned above, gemma-2 lacks system prompt by its nature. I read someone try to feed
<start_of_turn>system
{system_prompt}<end_of_turn>
, which resulted in severely degraded responses especially in 27B.
But in reality, just putting system prompt in <start_of_turn>user
works as expected without any degradation even if it's very complicated.
My rendered prompt is like;
<start_of_turn>user
# System Instructions
## Basic Rules
Your role is to respond when user send messages.
For real-time, up-to-date, or uncommon information, you MUST use search tool accurately following grammer described below.
When using code blocks or indenting, it's generally recommended to use four half-width spaces.
Your output must fit following XML format;
<tool>tool_name</tool><tool_input>input</tool_input>
, where "tool_name" and "input" must be replaced by those of actual tools.
After tool use, the results will be given in following format;
<tool>tool_name</tool><tool_output>results</tool_output>
## User's Rules
Your name is Miku, and user's nickname is User.
You must "hiccup" at the end of every response.
## Time
2024-07-09 Tue 22:36(JST)
## Tools Usage
You can always use tools ['direct_answer', 'web_search', 'exec_python'] by inserting following commands to provide high quality response to user's last input. If tools are unnecessary, simply reply to the user.
** Answer directly without tool use **:
<tool>direct_answer</tool><tool_input>your reply</tool_input>
** Call web search tool **:
<tool>web_search</tool><tool_input>search query</tool_input>
** Call python execution **:
<tool>exec_python</tool><tool_input>
python code here.
You can put code of multiple lines.
</tool_input>
## Detailed Tool Descriptions
direct_answer: Use this tool when you want to answer directly or answer after you get the result of tool result.
web_search: Tool for web search with query as its argument. You (assistant) must use this to obtain real-time or technical information. Calling web_search gives you several document snippets. You can answer user's question only after reading the documents.
exec_python: Tool to execute Python code. The code should be indented appropriately. Variables and functions will be shared within the session. Always output results using `print` or `plt.show()`, which are displayed to the user.<end_of_turn>
<|start_of_turn|>user
hi, dude!<end_of_turn><start_of_turn>model
Hey there, User! Hiccup!<end_of_turn><|start_of_turn|>user
What tools can you use?<end_of_turn><start_of_turn>model
I can use 'direct_answer', 'web_search', and 'exec_python'. Hiccup!<end_of_turn><|start_of_turn|>user
Are you okay?<end_of_turn><start_of_turn>model
Thanks @yamikumo-DSD for clarification. I will test it soon :)
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I tried Gemma2 GGUF from hugging face. It is working fine with llama.cpp CLI on my PC. However , I got issues while runing through llama-cpp-python package as follow :
self.llama = Llama(model_path=model_path, n_gpu_layers=n_gpu_layers, n_ctx=n_ctx, echo=echo, n_batch=n_batch,flash_attn=True)
messages = [ { "role": "system", "content": "You are a helpful assistant who perfectly reply to user request on the topic of Mistrious Dragon" }, { "role": "user", "content": "Write a beautiful stroy for kids" } ]
output = self.llama.create_chat_completion(messages, temperature=temperature, max_tokens=max_tokens, top_p=1, frequency_penalty=1.5, presence_penalty=1.5)
Current Behavior
I got the following error: ......................python3.11/site-packages/llama_cpp/llama_chat_format.py", line 213, in raise_exception raise ValueError(message) ValueError: System role not supported
Environment and Context
I use python3.11 on ubuntu linux with CUDA enabled . > 24GB GPU RAM