For every message submitted to the Cat, we can store in working_memory:

tokens used (input and output)
prompts used
replies for each prompt

Info can be sent back to the client in the why This would allow for easier debugging and better estimates on resources usage.

Just a proposal: interactions could be accumulated in a list, something like:


# init (before cat reads message)
cat.working_memory.model_interactions = []

# at each LLM usage
cat.working_memory.model_interactions.append(
  ModelInteraction(
    model_type="llm"
    source="ProceduresAgent"
    prompt="some prompt",
    reply="llm output",
    input_tokens=340,
    output_tokens=100
  )
)

# at each embedder usage
cat.working_memory.model_interactions.append(
  ModelInteraction(
    model_type="embedder"
    source="recall"
    prompt="some prompt",
    reply=[0.3, 0.1, 0.87],
    input_tokens=340,
    output_tokens=0
  )
)

# when seding response back to client
CatMessage.why.model_interactions = cat.working_memory.model_interactions

Token count

We could count how many tokens have been used for an interaction, if the count is general for all LLMs. No OpenAI only solutions!

Input tokens

For input tokens, we could use tiktoken which is already a dependency in core. Problem is, tiktoken is OpenAI only, so we can count the tokens in a prompt but not sure the estimate will hold for non-OpenAI LLMs. Probably not much of a difference?

Output otkens

For outptut tokens, add a langchain callback*, which would be independent of the vendor, but only work for streaming LLMs. Or count with tiktoken once the llm reply is back.

*We already have a callback NewTokenHandler, we can add TokenCounter (so count is disjoint from sending each token to the client)

cheshire-cat-ai / core

Store intermediate prompts, replies, and token count for each message - and send back to the client #864

Token count

Input tokens

Output otkens