cheshire-cat-ai / core

Production ready AI agent framework
https://cheshirecat.ai
GNU General Public License v3.0
2.14k stars 282 forks source link

Store intermediate prompts, replies, and token count for each message - and send back to the client #864

Open pieroit opened 6 days ago

pieroit commented 6 days ago

For every message submitted to the Cat, we can store in working_memory:

Info can be sent back to the client in the why This would allow for easier debugging and better estimates on resources usage.

Just a proposal: interactions could be accumulated in a list, something like:


# init (before cat reads message)
cat.working_memory.model_interactions = []

# at each LLM usage
cat.working_memory.model_interactions.append(
  ModelInteraction(
    model_type="llm"
    source="ProceduresAgent"
    prompt="some prompt",
    reply="llm output",
    input_tokens=340,
    output_tokens=100
  )
)

# at each embedder usage
cat.working_memory.model_interactions.append(
  ModelInteraction(
    model_type="embedder"
    source="recall"
    prompt="some prompt",
    reply=[0.3, 0.1, 0.87],
    input_tokens=340,
    output_tokens=0
  )
)

# when seding response back to client
CatMessage.why.model_interactions = cat.working_memory.model_interactions

Token count

We could count how many tokens have been used for an interaction, if the count is general for all LLMs. No OpenAI only solutions!

Input tokens

For input tokens, we could use tiktoken which is already a dependency in core. Problem is, tiktoken is OpenAI only, so we can count the tokens in a prompt but not sure the estimate will hold for non-OpenAI LLMs. Probably not much of a difference?

Output otkens

For outptut tokens, add a langchain callback*, which would be independent of the vendor, but only work for streaming LLMs. Or count with tiktoken once the llm reply is back.


*We already have a callback NewTokenHandler, we can add TokenCounter (so count is disjoint from sending each token to the client)

zAlweNy26 commented 6 days ago

Amazing idea 💯