For every message submitted to the Cat, we can store in working_memory:
tokens used (input and output)
prompts used
replies for each prompt
Info can be sent back to the client in the why
This would allow for easier debugging and better estimates on resources usage.
Just a proposal: interactions could be accumulated in a list, something like:
# init (before cat reads message)
cat.working_memory.model_interactions = []
# at each LLM usage
cat.working_memory.model_interactions.append(
ModelInteraction(
model_type="llm"
source="ProceduresAgent"
prompt="some prompt",
reply="llm output",
input_tokens=340,
output_tokens=100
)
)
# at each embedder usage
cat.working_memory.model_interactions.append(
ModelInteraction(
model_type="embedder"
source="recall"
prompt="some prompt",
reply=[0.3, 0.1, 0.87],
input_tokens=340,
output_tokens=0
)
)
# when seding response back to client
CatMessage.why.model_interactions = cat.working_memory.model_interactions
Token count
We could count how many tokens have been used for an interaction, if the count is general for all LLMs.
No OpenAI only solutions!
Input tokens
For input tokens, we could use tiktoken which is already a dependency in core.
Problem is, tiktoken is OpenAI only, so we can count the tokens in a prompt but not sure the estimate will hold for non-OpenAI LLMs.
Probably not much of a difference?
Output otkens
For outptut tokens, add a langchain callback*, which would be independent of the vendor, but only work for streaming LLMs.
Or count with tiktoken once the llm reply is back.
*We already have a callback NewTokenHandler, we can add TokenCounter (so count is disjoint from sending each token to the client)
For every message submitted to the Cat, we can store in working_memory:
Info can be sent back to the client in the
why
This would allow for easier debugging and better estimates on resources usage.Just a proposal: interactions could be accumulated in a list, something like:
Token count
We could count how many tokens have been used for an interaction, if the count is general for all LLMs. No OpenAI only solutions!
Input tokens
For input tokens, we could use
tiktoken
which is already a dependency in core. Problem is, tiktoken is OpenAI only, so we can count the tokens in a prompt but not sure the estimate will hold for non-OpenAI LLMs. Probably not much of a difference?Output otkens
For outptut tokens, add a langchain callback*, which would be independent of the vendor, but only work for streaming LLMs. Or count with tiktoken once the llm reply is back.
*We already have a callback
NewTokenHandler
, we can addTokenCounter
(so count is disjoint from sending each token to the client)