Add support for Hermes Pro function calling to llama-cpp-python

Benjoyo commented 8 months ago

Hey, thank you so much for the great model and this repo!

Would you be willing to add support for this chat format to llama-cpp-python, so that we can use function calling (and JSON mode) with their OpenAI compatible server?

Right now, llama-cpp-python offers the only OpenAI compatible server with constrained/grammar based sampling for CPU that I am aware of. It has been very convenient to use with the functionary models, as it is plug&play with the openai client and very reliable thanks to the grammar sampling.

Besides functionary, there is already support for a format called chatml-function-calling which might be similar enough to the Hermes format to be able to just adapt it instead of writing something from scratch:

https://github.com/abetlen/llama-cpp-python/blob/6eb25231e4dafeb792ffc9597c27330344c970b1/llama_cpp/llama_chat_format.py#L2045

All that would need to be added to the library is a handler like that.

Thanks!

teknium1 commented 8 months ago

I dont quite understand how it works. The only difference in format is that after you get the result json from your function call you have to return it through a role tool instead of user or assistant, a standard chatml compatible format should be able to use that.

Unless I'm misunderstanding?

Benjoyo commented 8 months ago

I dont quite understand how it works. The only difference in format is that after you get the result json from your function call you have to return it through a role tool instead of user or assistant, a standard chatml compatible format should be able to use that.

Unless I'm misunderstanding?

Ok, so actually using the chatml-function-calling format (yes, a chatml compatible format) with Hermes Pro does actually work. But that is only because the model is so good and the handler injects a system prompt explaining the function calling format that Hermes Pro then actually follows (or at least the grammar sampling makes it do it fine) although it was trained on a different format.

Here is the formatted chatml-function-calling prompt for reference:

<|im_start|>system
Extract all orders from the user input as a JSON list.

You have access to the following functions:

functions.store_orders:
{"properties": {"orders": {"description": "the orders", "items": {"properties": {"customer_name": {"description": "the name of the customer that this order is for", "type": "string"}, "number": {"description": "the order number", "type": "number"}}, "required": ["number", "customer_name"], "type": "object"}, "type": "array"}}, "required": ["orders"], "type": "object"}

You can respond to users messages with either a single message or one or more function calls.

To respond with a message begin the message with 'message:', use the following format:

message:
<message>

To respond with one or more function calls begin the message with 'functions.<function_name>:', use the following format:

functions.<function_name>:
{ "arg1": "value1", "arg2": "value2" }
functions.<function_name>:
{ "arg1": "value1", "arg2": "value2" }<|im_end|>
<|im_start|>user
Hello it&#39;s Mike. Let me get the #3 and the #12. For my friend Jeff I would like the #2.<|im_end|>
<|im_start|>assistant

Now, to get the best performance one would want to use the correct Hermes Pro format, which does not work with this handler, as it forces pure json output (without XML tags) for fixed tool_choice and stops at a :for auto tool_choice, following the other format.

But looking at it closer as I did just now, I might be able to adapt this myself (or have Claude Opus do it lol).

interstellarninja commented 8 months ago

Hi @Benjoyo I just had a look at the chatml-function-calling handler and while i understand how tools are passed into the prompt as function signatures, like @teknium1 i don't understand how exactly function call parsing is done.

for chatml-function-calling you could change the system prompt in the jinja template but the method for adding function signatures to the system prompt would remain the same.

# System message
        "{% if message.role == 'system' %}"
        "{{ message.content }}"
        "{% if tool_calls %}"
        "\nYou are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions."
        "\nHere are the available tools:"
        "\n<tools> tools </tools>"
        "\n\nYou can respond to users messages with either a single message or one or more function calls."
        "\n\nTo respond with a message begin the message with 'message:', use the following format:"
        "\n\nmessage:"
        "\n<message>"
        "\nFor each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:"
         "\n<tool_call>"
         "\n{"arguments": <args-dict>, "name": <function-name>}"
         "\n</tool_call>"
        "{% endif %}"
        "<|im_end|>\n"
        "{% endif %}"

For parsing function calls, you'd have to add a parsing function such as below:

import ast
import xml.etree.ElementTree as ET

class FunctionCall(BaseModel):
    arguments: dict
    name: str

 def parse_function_calls(completion):
        function_calls = []
        root = ET.fromstring(f"<root>{text}</root>")

        for tool_call in root.findall(".//tool_call"):
            try:
                function_call_json = json.loads(tool_call.text.strip())
                function_call = FunctionCall(**function_call_json)
                function_calls.append(function_call)
            except json.JSONDecodeError:
                try:
                    # Try parsing with ast.literal_eval if json.loads fails
                    function_call_json = ast.literal_eval(tool_call.text.strip())
                    function_call = FunctionCall(**function_call_json)
                    function_calls.append(function_call)
                except (ValueError, SyntaxError):
                    pass

        return function_calls

It looks like this is how they are parsing function calls in the functionary-chat-handler which you can modify by using the function above:

    if function_call is None or (
        isinstance(function_call, str) and function_call == "auto"
    ):
        stop = "\n"
        completion: llama_types.Completion = llama.create_completion(
            prompt=prompt, stop=stop, stream=False
        )  # type: ignore
        completion_text = completion["choices"][0]["text"]
        # strip " to=functions." and ending ":"
        function_call = completion_text.split(".")[-1][:-1]
        new_prompt = prompt + completion_text + stop

https://github.com/abetlen/llama-cpp-python/blob/6eb25231e4dafeb792ffc9597c27330344c970b1/llama_cpp/llama_chat_format.py#L1287C9-L1288C60

But for chatml-function-calling handler I couldn't understand exactly how they are parsing function calls.

Hope this helps 🙂

Benjoyo commented 8 months ago

Thanks @interstellarninja for the hints. I will have a closer look and try to make a PR soon.

interstellarninja commented 8 months ago

At another look, my guess is they are completely relying on grammars for parsing the function calls in both handlers

Here's how they are defining grammars using function signatures in chatml-function-calling handler

    # One or more function calls
    tool_name = text[len("functions.") :]
    tool = next((tool for tool in tools if tool["function"]["name"] == tool_name), None)
    if not stream:
        completions = []
        completions_tool_name = []
        while tool is not None:
            prompt += f"functions.{tool_name}:\n"
            try:
                grammar = llama_grammar.LlamaGrammar.from_json_schema(
                    json.dumps(tool["function"]["parameters"]), verbose=llama.verbose
                )
            except Exception as e:
                grammar = llama_grammar.LlamaGrammar.from_string(
                    llama_grammar.JSON_GBNF, verbose=llama.verbose
                )
                if llama.verbose:
                    print(
                        "Failed to parse function body as JSON schema, falling back to default grammar"
                    )
                    print(e)

So perhaps changing the system prompt and a few other things might just work even if function calls are wrapped around XML tags.

Benjoyo commented 8 months ago

Yeah, I think one needs to change the stop sequence to the starting tag to detect the model wanting to generate a call, then use grammar, and then stop at the closing tag. These stop criteria are different in the other format. Rest could be very similar or same.

interstellarninja commented 7 months ago

hey @adrienbrault has implemented OpenAI function calling format for Ollama, maybe this serves as a guide even though it is written in golang. https://github.com/ollama/ollama/pull/3303

NousResearch / Hermes-Function-Calling

Add support for Hermes Pro function calling to llama-cpp-python #5