Maximilian-Winter / llama-cpp-agent

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output. Works also with models not fine-tuned to JSON output and function calls.
Other
445 stars 38 forks source link

Large portion of time spent on sample time #36

Closed this-josh closed 2 months ago

this-josh commented 2 months ago

I'm running Llama 3 with two A40s and am finding the llama-cpp-agent has a high sample time. To use the book example I find my sample time for creating an object is an order of magnitude slower. (I've removed the output text below)

Is this an unavoidable consequence of this output formatting?

>>> print(structured_output_agent.create_object(Book, text))
llama_print_timings:        load time =     208.15 ms
llama_print_timings:      sample time =   11775.43 ms /   151 runs   (   77.98 ms per token,    12.82 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    3479.97 ms /   151 runs   (   23.05 ms per token,    43.39 tokens per second)
llama_print_timings:       total time =   17029.76 ms /   152 tokens
}
>>> print(main_model(text))

llama_print_timings:        load time =     208.15 ms
llama_print_timings:      sample time =      11.85 ms /    16 runs   (    0.74 ms per token,  1350.44 tokens per second)
llama_print_timings: prompt eval time =     155.84 ms /    83 tokens (    1.88 ms per token,   532.59 tokens per second)
llama_print_timings:        eval time =     334.33 ms /    15 runs   (   22.29 ms per token,    44.87 tokens per second)
llama_print_timings:       total time =     597.83 ms /    98 tokens

Full script, I'm using the main branch of llama-cpp-agent

main_model = Llama(
    "./models/gguf/Meta-Llama-3-8B.Q4_K_M.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    embedding=False,
    n_threads=48,
    n_batch=2048,
    n_ctx=2048,
    last_n_tokens_size=1024,
    verbose=True,
    seed=42,
    predefined_messages_formatter_type= MessagesFormatterType.LLAMA_3,
    stream=True
)

# Example enum for our output model
class Category(Enum):
    Fiction = "Fiction"
    NonFiction = "Non-Fiction"

# Example output model
class Book(BaseModel):
    """
    Represents an entry about a book.
    """
    title: str = Field(..., description="Title of the book.")
    author: str = Field(..., description="Author of the book.")
    published_year: int = Field(..., description="Publishing year of the book.")
    keywords: list[str] = Field(..., description="A list of keywords.")
    category: Category = Field(..., description="Category of the book.")
    summary: str = Field(..., description="Summary of the book.")

structured_output_agent = StructuredOutputAgent(main_model, debug_output=True)

text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
print(structured_output_agent.create_object(Book, text))
print(main_model(text))
Maximilian-Winter commented 2 months ago

I'm not sure how much grammar based sampling is affecting the performance. But this seems to be a huge impact. Can you send me the grammar generated itself? I just want to make sure I doesn't messed up the generation of the grammar.

this-josh commented 2 months ago

Hi, I think this is what you mean

>>> from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import generate_gbnf_grammar_from_pydantic_models
>>> generate_gbnf_grammar_from_pydantic_models([Book])
'root ::= (" "| "\\n") grammar-models\ngrammar-models ::= book\nbook ::= "{"  ws "\\"title\\"" ": " string ","  ws "\\"author\\"" ": " string ","  ws "\\"published_year\\"" ": " number ","  ws "\\"keywords\\"" ": " book-keywords ","  ws "\\"category\\"" ": " book-category ","  ws "\\"summary\\"" ": " string ws "}"\nbook-keywords ::= "[" ws string ("," ws string)* ws "]" \nbook-category ::= "\\"Fiction\\"" | "\\"Non-Fiction\\""'
Maximilian-Winter commented 2 months ago

@this-josh I made some tests, I used the same prompt for generation with grammar and without. My results are that it is about 12 times faster without grammar. I still have to do some additional tests.

With grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =   11026.43 ms /   141 runs   (   78.20 ms per token,    12.79 tokens per second)
llama_print_timings: prompt eval time =     307.34 ms /   248 tokens (    1.24 ms per token,   806.92 tokens per second)
llama_print_timings:        eval time =    5111.64 ms /   140 runs   (   36.51 ms per token,    27.39 tokens per second)
llama_print_timings:       total time =   17923.20 ms /   388 tokens

Without grammar:
llama_print_timings:        load time =     307.76 ms
llama_print_timings:      sample time =     844.00 ms /   138 runs   (    6.12 ms per token,   163.51 tokens per second)
llama_print_timings: prompt eval time =     280.42 ms /   218 tokens (    1.29 ms per token,   777.39 tokens per second)
llama_print_timings:        eval time =    3609.99 ms /   137 runs   (   26.35 ms per token,    37.95 tokens per second)
llama_print_timings:       total time =    5686.03 ms /   355 tokens

This is my full test code:

from enum import Enum

from llama_cpp import Llama
from pydantic import BaseModel, Field

from llama_cpp_agent.gbnf_grammar_generator.gbnf_grammar_from_pydantic_models import \
    generate_gbnf_grammar_and_documentation
from llama_cpp_agent.llm_prompt_template import PromptTemplate
from llama_cpp_agent.llm_settings import LlamaLLMGenerationSettings
from llama_cpp_agent.messages_formatter import MessagesFormatterType, get_predefined_messages_formatter
from llama_cpp_agent.structured_output_agent import StructuredOutputAgent

settings = LlamaLLMGenerationSettings(stream=False)
main_model = Llama(
    "../gguf-models/Meta-Llama-3-8B-Instruct.Q5_k_m_with_temp_stop_token_fix.gguf",
    n_gpu_layers=-1,
    use_mlock=False,
    embedding=False,
    n_threads=12,
    n_batch=2048,
    n_ctx=2048,
    last_n_tokens_size=1024,
    verbose=True,
    seed=42,
    stream=True
)
# Example enum for our output model
class Category(Enum):
    Fiction = "Fiction"
    NonFiction = "Non-Fiction"

# Example output model
class Book(BaseModel):
    """
    Represents an entry about a book.
    """
    title: str = Field(..., description="Title of the book.")
    author: str = Field(..., description="Author of the book.")
    published_year: int = Field(..., description="Publishing year of the book.")
    keywords: list[str] = Field(..., description="A list of keywords.")
    category: Category = Field(..., description="Category of the book.")
    summary: str = Field(..., description="Summary of the book.")

structured_output_agent = StructuredOutputAgent(main_model, llama_generation_settings=settings,
                                                messages_formatter_type=MessagesFormatterType.LLAMA_3,
                                                debug_output=False)

text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
print(structured_output_agent.create_object(Book, text))

grammar, documentation = generate_gbnf_grammar_and_documentation(
    [Book],
    model_prefix="Response Model",
    fields_prefix="Response Model Field",
)

sys_prompt_template = PromptTemplate.from_string(
    "You are an advanced AI agent. You are tasked to assist the user by creating structured output in JSON format.\n\n{documentation}"
)
creation_prompt_template = PromptTemplate.from_string(
    "Create an JSON response based on the following input.\n\nInput:\n\n{user_input}"
)
sys_prompt = sys_prompt_template.generate_prompt({"documentation": documentation})
msg = creation_prompt_template.generate_prompt({"user_input": text})

sys_msg = {"role": "system", "content": msg}
user_msg = {"role": "user", "content": msg}

msg_list = [sys_msg, user_msg]
formatter = get_predefined_messages_formatter(MessagesFormatterType.LLAMA_3)
prompt, role = formatter.format_messages(msg_list, "assistant")
settings_dic = settings.as_dict()
settings_dic["stop"] = settings_dic["stop_sequences"]
settings_dic.pop("stop_sequences")
settings_dic.pop("print_output")
main_model.reset()
print(main_model.create_completion(prompt, **settings.as_dict()))
Maximilian-Winter commented 2 months ago

After trying out different forms of grammar, I can say it always takes about 12.5 ms per token.

this-josh commented 2 months ago

Hi @Maximilian-Winter,

Thanks for looking int this, unfortunately, I'm a little confused by your analysis and do not see this issue as closed. I've updated to the latest version of the package and am now using what I believe is the same model Meta-Llama-3-8B.Q5_K_M.gguf.

You show some timings above of around 75ms and 6ms with and without grammar respectively, I can produce this approximate 10x ratio too using your code. But in the message after you say about 12.5ms per token, I'm not sure where this figure comes from?

So now we have three values sample time, none being 12.5ms (which is a comparatively high figure)

So my question remains, why is the create_object sample time 100x slower than just providing the text to the model?

Perhaps you could expand on your tests.

Maximilian-Winter commented 2 months ago

Sorry I meant that I tried different versions of the grammar, optimized for performance, and the result is always around 12.5 tokens per second. Not 12.5 ms per token. The reason for that is, that the grammar based sampling takes a lot of time. The reason your main_context(text) is going this fast, is that it only generates 16 tokens and it only uses the text as prompt. My code (main_model.create_completion) uses the same prompt as the structured output agent, but without grammar sampling. It generates around 140 tokens. The generation slows down as longer the output gets. Without grammar I get 163.51 tokens per second

this-josh commented 2 months ago

Ah understood.

So is it expected that grammar can increase the sampling time per token by an order of magnitude?

Maximilian-Winter commented 2 months ago

I would say it depends on the complexity of the grammar.