Closed nick-youngblut closed 6 months ago
This is a good idea. It also benefits from the fact that LlamaIndex already has a TokenCountingHandler callback. In your view, what would this look like? Just some extra attributes?
Something like: token_counter.total_prompt_cost
?
One note of caution: Tokenizing, especially with tiktoken, is actually quite slow. Where you can help it, it's generally better to have a separate service handle tokenizing + counting. The main use case I've seen is people want to track their total token usage after making all their requests rather than estimating before.
For AgentOps, We've built out a basic callback handler for Langchain agents. I'm going to put LlamaIndex on our radar as well.
I'm mainly using the SQLTableRetrieverQueryEngine query engine in llama-index, so I can't simply use the user's prompt and the model response to calculate costs with tokencost, given that the query engine utilizes multiple prompt templates for the sql query and summarization.
I found the TokenCountingHandler docs, but this callback tokenizes instead of just providing the text for the entire prompt (& response).
It would be great if tokencost included a callback method that was fully compatible with llama-index, but didn't require tokenization with tiktoken, given it is slow (as you point out). Maybe just a modified version of the Aim Callback could be used?
I'm mainly using the SQLTableRetrieverQueryEngine query engine in llama-index, so I can't simply use the user's prompt and the model response to calculate costs with tokencost, given that the query engine utilizes multiple prompt templates for the sql query and summarization.
Yep, same issue with LangChain. Llama and LC do a lot of heavy lifting to abstract away the prompt manipulation, but it comes at the cost of the users not knowing what the exact inputs + outputs are. LangChain was too difficult for me to pull the prompts out, so I just relied on the prompt_tokens
and completion_tokens
they provide in the generations response. (I think this is provided by OpenAI). To calculate the cost, I just multiply it by the cost lookup values in the tokencost TOKEN_COSTS
dict.
I don't know LlamaIndex well enough to know how to extract the prompts. Do they offer something analogous?
If you have a specific fix in mind, happy to merge it in.
I've built off of the SimpleLLMHandler to create a token cost calculator callback:
from typing import Any, Dict, List, Optional, cast
from llama_index.callbacks.base_handler import BaseCallbackHandler
from llama_index.callbacks.schema import CBEventType, EventPayload
from tokencost import calculate_prompt_cost, calculate_completion_cost, USD_PER_TPU
class TokenCostHandler(BaseCallbackHandler):
"""Callback handler for printing llms inputs/outputs."""
def __init__(self, model) -> None:
super().__init__(event_starts_to_ignore=[], event_ends_to_ignore=[])
self.model = model
def start_trace(self, trace_id: Optional[str] = None) -> None:
return
def end_trace(
self,
trace_id: Optional[str] = None,
trace_map: Optional[Dict[str, List[str]]] = None,
) -> None:
return
def _calc_llm_event_cost(self, payload: dict) -> None:
from llama_index.llms import ChatMessage
if EventPayload.PROMPT in payload:
prompt = str(payload.get(EventPayload.PROMPT))
completion = str(payload.get(EventPayload.COMPLETION))
prompt_cost = calculate_prompt_cost(prompt, self.model) / USD_PER_TPU
completion_cost = calculate_completion_cost(completion, self.model) / USD_PER_TPU
elif EventPayload.MESSAGES in payload:
messages = cast(List[ChatMessage], payload.get(EventPayload.MESSAGES, []))
messages_str = "\n".join([str(x) for x in messages])
prompt_cost = calculate_prompt_cost(messages_str, self.model) / USD_PER_TPU
response = str(payload.get(EventPayload.RESPONSE))
completion_cost = calculate_completion_cost(response, self.model) / USD_PER_TPU
print(f"# Prompt cost: {prompt_cost}")
print(f"# Completion: {completion_cost}")
print("\n")
def on_event_start(
self,
event_type: CBEventType,
payload: Optional[Dict[str, Any]] = None,
event_id: str = "",
parent_id: str = "",
**kwargs: Any,
) -> str:
return event_id
def on_event_end(
self,
event_type: CBEventType,
payload: Optional[Dict[str, Any]] = None,
event_id: str = "",
**kwargs: Any,
) -> None:
"""Count the LLM or Embedding tokens as needed."""
if event_type == CBEventType.LLM and payload is not None:
self._calc_llm_event_cost(payload)
I'm not sure if I should be calculating for messages_str
and response
This is awesome, thanks! By the looks of it, this seems like the correct implementation. The only thing I'd caution is coming up with the correct name when self.model = model
. There isn't really a standard dictionary of model names yet. This is what I'd hope to achieve with TokenCost, and LlamaIndex might have different names.
I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?
I've built off of the [SimpleLLMHandler](https://github.com/run- I'm not sure if I should be calculating for
messages_str
andresponse
Looks correct to me, is LlamaIndex returning messages as a messages dict (i.e. what OpenAI suggests using) or as a plain string? calculate_prompt_cost
is designed to handle either, but there might be a slight over/underestimate unless we know exactly what format of string is being sent to OpenAI.
The only thing I'd caution is coming up with the correct name when self.model = model
I believe llama-index uses the name model naming as specified by Open AI (https://platform.openai.com/docs/models). So as long as tokencost supports all of those names, there shouldn't be an issue.
I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?
You could include the callback class to tokencost, but llama-index then becomes a dependency.
The only thing I'd caution is coming up with the correct name when self.model = model
I believe llama-index uses the name model naming as specified by Open AI (https://platform.openai.com/docs/models). So as long as tokencost supports all of those names, there shouldn't be an issue.
I don't know the best way to fit this in the repo, perhaps adding a callbacks README section. Any suggestions?
You could include the callback class to tokencost, but llama-index then becomes a dependency.
Created a PR with the callback handler. I imagine some folks using Langchain will want something similar, so it makes sense to add. I was able to add llama-index as an optional dependency. Simply pip install tokencost[llama-index]
, and the callback handler will be available.
Let me know if this helps; if so, I'll merge
Let me know if this helps; if so, I'll merge
That's awesome! It will be very helpful to have the callback handler in tokencost versus adding the same code to each of my projects. Thanks!
Now available version 0.0.5
:)
For using tokencost with llama-index, it would be helpful to include info in the tokencost docs on how to use tokencost will a llama-index callback manager, as described in https://docs.llamaindex.ai/en/stable/examples/callbacks/TokenCountingHandler.html